Knowledge Officer is one of the newborn platforms in content management and data filtration. We work on managing the huge amount of content on the internet and translating it to defined learning paths, in order to help users reach their career goals. Today, I’ll share with you some of the technical challenges we have faced since the early days of this long journey and how we reshaped our architecture from a 1 month MVP to a scalable and maintainable product.
Natural Language Processing & Machine Learning are our 2 main players. We drive those 2 hot technologies to manage and deliver quality content to people who are starving for knowledge. On a day-to-day basis, we face many challenges, some of which are:
- content mining
- content cleaning
- text processing and information retrieval
- text labeling and classification
- personalized recommendations for our users
Before diving into each challenge, I want to clarify that with startups, especially in the early days, EVERYTHING is unexpected and must be measured carefully from different points of view. There are 3 main factors you have to consider when faced with any decision making process in a startup’s early days; your team’s level of experience, the time required for actual implementation as well as current and future challenges you are likely to face.
So, instead of jumping right into implementation mode, we first start by searching for existing solutions and best practices. It’s can be very challenging, especially in the early days, to come up with a solution that will solve our current problems and any future unanticipated problems.
My team and I were responsible for implementing the initial version ofKnowledge Officer; a platform that aggregates noise-free content from different sources in one place.
When putting the initial system design, we became aware that software architecture is a very widely-used term and that millions of articles on the web, talk about it, where each author fight to validate his point of view. We had to read as much as possible and learn from their experiences.
And so, we realized that the 3 main questions we have to answer are
- Which web technology will we use?
- How can we store/represent our data?
- And, what is the cost of deploying our application for production?
Which Web technology will we use?
Our Knowledge Officer’s team had good knowledge in Ruby on Rails. Also, RoR is very fast to set up any application from scratch. Time was critical and validating our idea was our number one priority. We came to the decision to learn ReactJS and have it be our front-end technology while using RoR for the backend. We regard this as a small investment we had to make at the time.
We realized that we need to challenge ourselves with learning new technologies; how are we supposed to encourage others to pursue continuous learning if we don’t start ourselves?!
How can we store/represent our data?
We stored our data in an RDBMS — PostgresSQL. You may wonder, why did we choose an SQL database for an application like that?
At the beginning, SQL was not bad at all. It was integrated very fast with Rails, compared to other NoSQL databases where their gems have a lot of problems working with Rails. The only challenge here was the data and how it would be stored for reuse in the future.
What is the cost of deploying our application to production?
Honestly, we didn’t care much about this. At this stage, you usually don’t have time to set up machines and it’s better to deploy on any cloud service instead. It could be a little bit more costly but you can start with the minimum requirements. In our case, we had an old machine setup that was ready for deployment after only minimal modifications.
Reaching your ideal case from day one is impossible. Start quickly and validate quickly and don’t waste your time.
Within a month, we published the first version of the website. Our crawlers are now running every day to get fresh data. Our engines responsible for processing the crawled data were rake tasks in our Rails application that ran periodically. The Main application + Management Portal + Engines, were all hitting the same database.
After 2 months, the data size increased rapidly and our queries became very slow. We went with a basic quick fix; working on sharding our database by isolating the newly crawled articles from the live reviewed ones. Then, we indexed tables on the main columns that touch the most frequent queries.
Go with basic quick fixes. Don’t over-engineer things or waste time on things that are not a top priority.
So, our MVP achieved its target, and we got our idea validated. It’s now time to consider that everything can be built easily, except DATA.
Now, we are faced with what one might call a technical debt:
- All our data; live articles and training data were all in one place.
- End-user application interacted with the same database that our engines are using to store new data i.e. a single point of failure.
- RDMS wasn’t suitable for un-structure data, such as different types of articles from different sources.
To solve this technical debt, it was now time to introduce a new architecture. Our Knowledge Officer’s team became bigger and more familiar with different technologies and so we focused our attention on other factors like scalability, performance and, quality.
First, we thought about protecting our end-user application flow and isolating its data, in order to avoid unwanted data corruption. We developed our API as an independent application that handles our mobile/web applications requests. It can Read/Write data on an RDMS when new data is in the ready state.
We have 3 engines built: Content Mining Engine, Processing Engine and, Classification Engine.
Each engine is running independently. All 3 engines are interacting with a NO-SQL database (Document-based database). This increases the isolation and security of our data moderation.
And so, we now have 3 isolated components, each with a specific mission and has its owner in the team. This architecture will help us maintain these components in the future.
Our next challenge wasto decide how and when to migrate the data from the Document-based DB to production DB (RDMS); the user-facing database.
We have researched 2 approaches; Data-Driven vs Event-Drivenarchitecture.
In my next article, I will discuss in detail each approach and how we managed to pick one.
Knowledge Officer is a learning platform for professionals. Our mission is to empower a generation of lifelong learners and to help people, however busy, learn something new and relevant every day and achieve their career goals.
And we’d love to hear your thoughts! So send us at firstname.lastname@example.org