Along with CouchDB/Couchbase, MongoDB was one of the top examples I had in mind when I wrote about document-oriented NoSQL. Invented by 10gen, MongoDB is an open source, no-schema DBMS, so it is suitable for very quick development cycles. Accordingly, there are a lot of MongoDB users who build small things quickly. But MongoDB has heftier uses as well, and naturally I’m focused more on those.
MongoDB’s data model is based on BSON, which seems to be JSON-on-steroids. In particular:
- You just bang things into single BSON objects managed by MongoDB; there is nothing like a foreign key to relate objects. However …
- … there are fields, datatypes, and so on within MongoDB BSON objects. The fields are indexed.
- There’s a multi-value/nested-data-structure flavor to MongoDB; for example, a BSON object might store multiple addresses in an array.
- You can’t do joins in MongoDB. Instead, you are encouraged to put what might be related records in a relational database into a single MongoDB object. If that doesn’t suffice, then use client-side logic to do the equivalent of joins. If that doesn’t suffice either, you’re not looking at a good MongoDB use case.
MongoDB has integrated MapReduce. Natural uses include:
- The usual kinds of transformations one might do via MapReduce.
- Aggregations one might otherwise do in SQL (e.g. GROUP BY kinds of things are an obvious MapReduce fit).
Improved aggregation/MapReduce performance is a roadmap item.
However, Dwight said MongoDB has excellent performance in simple real-time reporting, for example updating a counter 10,000 times per second. When I asked him why, reasons included:
- A memory-mapped data model.
- Deferred writes — a write might take a couple of seconds to actually persist.
- Optimism — you don’t have to wait for an acknowledgement if you write something to the database.
- “Upsert in place” – update in place without checking whether you’re doing a write or insert.
- General lack of overhead.
Inspired in part by Schooner’s internal benchmarks, I’ve come to think that, apples-to-apples, even the simplest key-value store will have < 3X single-node performance advantage over well-implemented MySQL. (Read, write, or blended.) The 10gen guys don’t dispute that. However, they point out that a single MongoDB request can process the equivalent of many relational rows, opening the possibility of much greater performance gains than that. In particular, there seem to be some Drupal implementations enjoying huge MongoDB-based speed-ups.
On a heavy-duty server (8-12 cores, 16-64 gigs of RAM), MongoDB can apparently do 20-30,000 writes or 100,000 reads per second. Improved concurrency and mixed read/write performance is coming, although obviously 10gen would think that MongoDB does pretty well in those areas already. The largest known MongoDB system does about 1 million reads/second. I would imagine that those figures require the database to fit into RAM, given:
- MongoDB’s memory-mapped architecture.
- This post mortem of the MongoDB Foursquare outage.
I’ve gotten conflicting signals as to whether there are any multi-hundred-node MongoDB deployments. But there are “lots” of 20-40 node ones, as well as lots under 10 nodes. Note that 1 million reads/second naively sounds as if it could be achieved on 10 MongoDB nodes — but I’d guess that’s not really the configuration.
MongoDB’s scale-out story starts with transparent sharding, although I think transparent sharding was a new feature in MongoDB 1.6 last summer. So far as I understand, what MongoDB sacrifices in the CAP Theorem is the P — partitions happen, and when they do you might not be able to write to the shard you want to. A MongoDB shard can have multiple nodes, in a master-slave set-up, and there’s some flexibility as to how intra-node consistency is handled.
MongoDB functionality futures include full-text search, and extensions to the general query language.