May 16, 2012

Goodbye MongoDB

Over the last two or three years we have been using MongoDB in some mid-size projects.

Now it is time to say goodbye to MongoDB for a variety of technical reasons:

  • the currently memory model of MongoDB based on memory-mapped files is brain-dead. Leaving memory management to the operating is a nice idea - in reality it does not scale and does not play very well. There is no single way to control  the memory usage using system tools except maintaining mongod instances on dedicated virtual machines without running further services. There are numerous complaints from people about this stupid architectural decision and 10gen is doing nothing to change this brain-dead memory model.
  • Locking: a global server lock for a scalable database solution is a no-go - especially since MongoDB  supports only atomic operations. Now there is relief in the making with more granular locking or the temporary yielding of the lock during long-running write operations. But this is more a workaround than a solid and scalable solution.
  • Query engine: the query engine of MongoDB still can only use of one index per query. How insane is this? There is no obvious reason why this limitation exists. The index model of MongoDB is very similar to relational databases - in fact: it borrows lots of ideas from relational database. Having worked on indexes and search engines  myself for more than a decade I can not recognize any particular reason why the query engine can not use multiple indexes per query - the query engine appears poorly implemented.
  • Query language: using JSON as a query language was a bad decision. The current JSON query language works for standard queries but the functionality of the operators is limited. It is still not possible to express arbitrary queries like in SQL using JSON. One would argue: not needed - but in reality there are always cases where you need more complex queries. The only way around is to implement something client-side or use the server-side JS code execution (single-threaded, slow). Having no option to perform an operation comparable to UPDATE table SET foo=bar WHERE.... (which is possibly a low-hanging fruit). There are various odds and ends with the query language and its implementation. E.g. why don't you get an error message when using the $and operator with MongoDB version that does not support it? Why does MongoDB not complain here about an inappropriate usage of operators? Look at the mailing list and discover such flaws all day long in various postings. Silently discarding errors is a worse thing. If there is a problem then raise the issue and don't hide it under the carpet.
  • Map-Reduce: Map-reduce in MongoDB feels like a useless appendix added at some point to MongoDB. Same problem as with server-side code execution: it blocks.  Now instead of fixing a bad implementation or fixing the underlaying architectural issues, 10gen seems to address the MR limitations by supporting Hadoop for the MR part - either they don't trust their own MR implementation or they won't/can't fix it. No, we do not need more tools for doing map-reduce - there are already too many moving parts in a setup for scalable applications. Either fix MR inside MongoDB or throw it out completely.
  • Sharding: yet another misfeature of MongoDB. Going from a single server installation to a partitioned setups is *huge*. You need at least two replica sets for the shards, three config servers and the load balancers. That's like building a skyscraper beside a small town-house.
  • Data-center awareness: yet another feature that has been tinkered together. Replica sets only support one primary with multiple secondaries. Writes can only go to one primary. Running a replica set across multiple datacenter is doable but writes can only go to one primary in one data-center. Assuming have a replica set with nodes in Europe, US and Asia with the current master being located in US: all writes from US and Asia need to be performed against the master in US and replicated back to the secondaries in Europe and Asia - insane and not scalable.
  • The "safe" mode is off by default: who made this idiotic decision? Many reports why people about data los have been seen - just for the reason that "safe" is off by default. Although this is documented here and there: does such a decision bring trust to MongoDB? Safe mode must be enabled by default - people should be able to turn it off for performance reasons and with the understanding that turning it off may lead to data loss unless they perform explicit error checking client-side.
  • Journaling: MongoDB pre-allocates 3 GB of data for journaling - independent of the actual database size(s) - insane for small installations.

MongoDB is currently more about marketing and hype than it deserves. The primary goal of 10gen is currently running through the world in order to tell the world how cool MongoDB is. The reason is clear: 10gen is trying to play all other databases in the same market against the wall with the funding they received from their investors. It is a legitimate goal of 10gen but the technical foundation is shaky. Many things like the query language and query processor are half-baked since MongoDB 1.2 (my first version I used) - and no significant improvement have been made since. Many people said that the MongoDB 2.0 version should have been 1.0 - and I agree with that. Yes, MongoDB is an emerging technology (with potential) but MongoDB is hyped by 10gen as a new enterprise-level database (and perhaps 10gen wants to position MongoDB against Oracle & friends). The truth is that many things are half-baked or need some more iterations in order to make them usable for public consumption.