Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included:
- MongoDB’s tunable consistency seems really interesting, with numerous choices available at the program-statement level.
- All rumored performance problems notwithstanding, Ron asserts that MongoDB often “kicks butt” in actual proof-of-concept (POC) bake-offs.
- Ron cites “12 different language bindings” as a key example of developer functionality giving 10gen an advantage vs. Ron’s previous employer MarkLogic.
- 10gen is working hard on management tools, security, and so on.
- Ron claims that the “MongoDB loses data” knock is a relic of the distant — i.e. 1-2 years ago — past.
- We had the same “Who needs joins?” discussion that I used to have with MarkLogic — Ron’s former company — and which MarkLogic has since disavowed.
- There’s nothing special about MongoDB’s b-tree indexes. (I mention that because Tokutek thinks it offers a faster MongoDB indexing option.)
While this wasn’t a numbers-oriented conversation, business highlights included:
- A lot of MongoDB’s competition is RDBMS — Oracle, SQL Server, MySQL, etc.
- MongoDB’s top NoSQL competitor is Cassandra. 10gen sees less Couchbase than before, and also less HBase than Cassandra.
- There’s yet another favorable MongoDB soft metric — 50,000 registrants for free online education, 2/3 outside the US.
I can add that anecdotal evidence from other industry participants suggests there’s a lot of MongoDB mindshare.
Specific traditional-enterprise use cases we discussed focused on combining data from heterogeneous systems. Specifically mentioned were:
- Reference data/360-degree customer view.
- Reference data about securities.
- Aggregation of analytic results from various analytic systems across an enterprise. (For risk management).
DBAs’ roles in development
A lot of marketing boils down to “We don’t need no stinking DBAs!!!” I’m thinking in particular of:
- Hadoop and/or exploratory BI* messaging that positions against the alleged badness of “traditional data warehousing”.
*See in particular the comments to that post.
The worst-case data warehousing scenario is indeed pretty bad. It could feature:
- Much internal discussion and politicking to determine the One True Way to view various data fields, with …
- … lots of ongoing bureaucratic safeguards in the area of data governance.
- Long additional efforts in the area of performance tuning.
- Data integration projects up the wazoo.
But if the goal is just to grab some data from an existing data warehouse, perhaps add in some additional data from the outside, and start analyzing it — well, then there are many attempted solutions to that problem, including from within the analytic RDBMS world. The question is whether the data warehouse administrators try to help — which usually means “Here’s your data; now go away and stop bothering me!” — or whether they focus on “business prevention”.
Meanwhile, on the NoSQL side:
- The smart folks at WibiData felt the need for schema-definition tools over HBase.
- Per Ron Avnur, MongoDB users are clamoring for consistency-rule specification via an administrative (rather than programmatic) UI.
It’s the old loose-/tight-coupling trade-off. Traditional relational practices offer a clean interface between database and code, but bundle the database characteristics for different applications tightly together. NoSQL tends to tie the database for any one app tightly to that app, at the cost of difficulties if multiple applications later try to use the same data. Either can make sense, depending on (for example):
- How it seems natural to organize your development and data administration talent.
- Whether the app is likely to survive long enough that you’ll want to run many other applications against the same database.