Let’s get one thing out of the way before we start: this post is not an attempt to disparage HBase. HBase is an extremely powerful tool; applied appropriately and skillfully under the right scenarios, it can move mountains. This post is about the evolution of Traackr’s data storage needs and how MongoDB ended up satisfying them. It’s also a tip of the hat at the MongoDB team and 10gen and the tremendous work they have done.
Back in late 2009 early 2010, Traackr was designing the foundations of its’ search engine and hunting for an appropriate datastore to back it up. Some of the requirements were:
- Built-in support for storing terabytes of text: that meant that we shouldn’t have to use or modify the software in an unconventional fashion beyond its’ original design to get it to store and retrieve the quantities of data we wanted.
- Flexible schema: Traackr deals with heterogenous data sources from the web, constantly discovering new content and new properties that characterize that content. The database had to allow us to model those properties across all stored content without requiring extensive schema migrations that would take the system offline.
- Ability to batch process the data: Traackr’s scoring algorithms take into account statistical measurements derived from our entire active data set. Those computations need to be run at least once a week to account for the continuous growth and shifts in data samples. We therefore needed a system that allowed performing computations on the whole data set in a “reasonable” amount of time. We didn’t have an exact number for “reasonable” but our goal was to keep those processes running under 6 hours early on Saturdays and leave enough time for other batch jobs that depended on these refreshed computations to run the remainder of the weekend hours.
Some of the contenders in our product selection matrix were:
- Traditional enterprise packages such as Oracle: ruled out because they were way out of our budget.
- MySQL: our content sizes vary from 140 character tweets to multi-page articles and using one size fits all BLOBs would be a tremendous waste of space. Granted, storage is cheap or one could design the schema to split content to different tables accordingto size but that would add needless complexity when there were other solutions that didn’t. Also, the flexibleschema requirement would not be fulfilled as every new column added or modified in a multi-million row table would require a time consuming migration. There are ways to mitigate this by creating attribute tables but those tables become very large on their own (think a dozen attributes per post times millions of posts). And while MySQL can be sharded to split large tables and data sets it requires the code the be cognizant about it while other solutions don’t. So we passed not only on MySQL, but also on most other open-source RDBMS solutions for the same reasons. Enter NoSQL.
- Cassandra: It fit the bill in terms of schema flexibility and storage capacity. The model of having the same deployable for each cluster node was very enticing; it made for much easier setup and maintenance for a small team like ours. But in late 2009 / early 2010, it still lacked batch processing options like MapReduce (those were introduced later on in 2010). It also seemed that there had been some period of inactivity around the project afterits’ initial 2008 release, so at the time, we were uncertain about its’ future adoption.
- MongoDB: it was still new at the time, so we had concerns about its’ stability and adoption. The document-based schema flexibility looked great but auto-sharding was still not available (came out mid-2010) and there were no out-of-the-box options for batch processing. So we made a note of it and kept looking.
- Riak: it was a serious contender for us; most of our requirements were being met and it presented the same promise of ease of use and deployment as Cassandra did. The team was an impressive bunch from Akamai that really seemed to know what they were doing. To top it all off, they were local to Boston and startup-friendly. Despite all of this, we ended up shying away due to questions of adoption. It was still too young of a project for us.
- HBase: back then, it was one of the most polished solutions with quite a bit of traction. The requirements were all there: ability to grow with large data sets, flexible schema, built-in batch processing with MapReduce, healthy community for support.We had our pick. It also provided “out-of-the box” secondary indexing through a contrib package that came with the source. This allowed us to avoid writing our own app layer indexing code or so we thought. Those secondary indexes ended up being a lot more critical in the longer run than we originally anticipated.
So we picked HBase and started running with it. We had to deal the learning curve of its’ setup and the various components and configurations. This pretty much took most of my time which unfortunately detracted from working on features. But we eventually got there and were able to get it humming (most of the time). Our weekend batch scoring requirements were met as expected: we were able to re-score our entire database in less than 30 minutes. We even contributed back to the code base (HBASE-2438 and HBASE-2426). Things looked good.
Then came the upgrade from HBase 0.20.x to 0.89.x/0.90. The code base was changing fast and we wanted to keep up with the latest speed and stability fixes. But there was one problem: the secondary indexing contrib packages were moved out of the main code base and as a result, our HBASE-2426 customizations were becoming stale. This was also signaling that indexing was in fact about to fall even further behind in priority instead of making it to the core source. Bad news for us since we depended on it; we had no choice but to keep our customizations up to speed. We eventually ended up dropping the contrib packages all together and completely re-wrote our secondary indexes using a more generic approach to avoid running on unsupported 3rd-party code. Even then, we still knew that app layer indexing was going to be slower and more brittle than any DB layer solution. This became even more apparent when we needed to evolve our domain model.
The basis of our data is built on 3 core entities: influencers, the channels on which they post content and the posts themselves. We had originally de-normalized the relationship between influencers and their affiliated channels, admittedly to be more inline with how our NoSQL datastore was intended to be used. While we knew that a given channel could be shared across multiple authors, we chose to repeat some channel data across influencers to simplify runtime random access. While this decision simplified queries, it ended up putting more strain and complexity on our content attribution logic. With more complexity come more bugs and those started rearing their ugly heads in the form of some mis-attributed content. The issue was compounded when we cranked our content tracking up a notch and introduced our daily monitor feature in mid-2011. To add to this challenge, we had also discovered after months in production that we needed a better approach for modeling the details of the relationship between an influencer and a channel as each influencer interacted with a given channel in their own way. So after a year and half of running on our original assumptions it was becoming clear that our model needed revisiting.
All of this was happening around the time we created an opening for a big-data engineer. The 2011 Santa Clara Hadoop Summit was also being held. We ended up attending the conference hoping to meet some talent. We even showed up at the HBase Contributor meetup where we mingled with some of the great minds behind the scenes. The trip was both exciting and revealing. The two major take aways for us were:
- Big-data engineers came at a premium: good luck competing with some of the big pocket books in the valley.
- Experienced HBase engineers were even more rare and the good ones where all strategically positioned around firms in the valley, so we would be better off either hiring on the East Coast or developing the capabilities in-house.
For a shop our size, there is a limit to how many of your resources you can take away from feature development to dedicate to infrastructural concerns. We had already spent a lot of dev time on the datastore infrastructure and our impending model changes were about to call for some more. Adding this up with the seemingly slim odds of us attracting an experienced HBase developer, it became apparent that continuing down the HBase route would be akin to trying to fit a square peg in a round hole. It was time to move on and look for a more appropriate solution. The tool was just not built for what we were trying to do and we could not afford to try to get it there. So we went back to re-evaluating NoSQL solutions. This time, solid support for secondary indexes was added to the requirements. The round two contenders were:
- Neo4j: a very powerful graph database capable of efficiently traversing complex relationships; too much for what we needed and primarily designed to be used in an embedded JVM mode, we would have to make some significant changes to our system to integrate it.
- MongoDB: it had matured by leaps and bounds since the last time we looked at it, with increased adoption from many shops and great support from 10gen. It came with advanced indexing out-of-the-box as well as some batch processing options (http://www.mongodb.org/display/DOCS/MapReduce, https://github.com/mongodb/mongo-hadoop). To top it all off, it was a breeze to use, well documented and fit into our existing code base very nicely.
- Cassandra: it had matured as well and now had support for secondary indexes but those seemed more restrictive than Mongo’s and Mongo still had an edge over it in terms of developer friendliness.
- Riak: still a strong contender and supported secondary indexes since release 1.0 but still lacked the traction that MongoDB had.
This time, the choice was much more straight forward. Many of the solutions had come a long way to meeting our needs, so we were able to make a selection not only based on specs but also based on ease of adoption. MongoDB was hands down the most approachable solution for us.
Having worked with it now, it’s no wonder why MongoDB is currently enjoying such growth. While the migration from HBase took us about three months the integration with MongoDB itself was achieved in just a couple weeks since we already had a DAO layer abstracted from the rest of our applications. The rest of the time was spent tweaking our new model and re-writing our content acquisition and attribution services. And at every step of that refactoring, we found that MongoDB was making things easier for us:
- Normalizing channels and influencers became a straight forward exercise as we were now able to model the associations as influencer collection sub-documents.
- Creating indexes for fast random access queries no longer required specialized code. We still had to be mindful of the memory implications but the implementation was much cleaner and easier to maintain.
- Basic things like backups became a breeze again. While HDFS has built-in redundancy within a cluster, replicating data in a backup cluster is still advisable but becomes expensive from a hardware and network point of view. For the longest time, our approach consisted in exporting the HBase tables to S3 on a regular basis, which took a lot of time and far from guaranteed data consistency on restore. With MongoDB, all our data currently fits on a single instance with a hot replica that we can switch to if the master goes down and a third backup machine whose EC2 EBS drive we snapshot on a regular basis after freezing its' XFS file system to ensure data consistency. While such a setup is of course possible with HDFS/HBase, support for it comes out-of-the-box with MongoDB and it can be done more affordably with a lot less hardware.
- The documentation was fantastic. Leaps and bounds ahead of the other solutions we evaluated (although the Riak folks are doing a great job as well), it was very well organized and the disqus comment integration meant that the dev community could easily pitch in if there were specific gaps.
- The speed was really impressive. We found that we were able to replace our weekly influencer re-scoring MapReduce jobs with straight MongoDB cursor iterators and still get the final results faster than before. Our data may of course outgrow this approach at some point but we are confident that we will be able to adapt using the available hadoop integration if we need to.
- The community was within reach and the feedback was consistant. We attended Mongo Boston and we were pleased to see that the size of the crowd confirmed what we were hearing and reading about the adoption of MongoDB. The sessions were super informative with great tips about how to optimize one's setup and what the major gotchas are. What's more, the suggestions and advise were consistant with what we were reading on line from separate sources, which was a refreshing change from some of the blind trial and error experience we had with our previous setup.
Looking forward, we think that we have finally nailed the solution that fits out needs for the foreseeable future. We no longer feel that we are fighting our datastore every step of the way. On the contrary, all our developers have given nothing but positive feedback on their MongoDB experience thus far and report being a lot more productive with it. This is a testament to the thoughtfulness the MongoDB engineering crew and 10gen have put behind the solution and we are looking forward to working with them in the months and years to come.