When I talked with MarkLogic’s Ken Chestnut about MarkLogic 4.2, I was surprised to learn that MarkLogic really, truly doesn’t do anything like a join. Unlike some other non-SQL DBMS, MarkLogic has no SQL interface, no ODBC or JDBC. Nothing, nada. (MarkLogic has a Java interface for Xquery, but not for anything like SQL.)
Since MarkLogic and other XML DBMS are used in applications for brokerage trades and the like, I used that area as my example for a challenge question: What happens when one brokerage firm buys another? (Similar challenges could be made about medical records or consumer profiling.) The answer was that you just have to update or augment each existing record with the new firm’s information. And by the way, if you choose to augment, then you have the new and old information side-by-side, both of which could conceivably come in handy.
Document-oriented NoSQL DBMS such as CouchDB and MongoDB face similar challenges, of course. I didn’t pursue the matter in depth in either case, but:
- If I understood Damien Katz correctly, CouchDB has a view capability that provides some kind of workaround. (A quick web search turned up this page on a kind of entity-relationship modeling in CouchDB and the associated querying.)
- Dwight Merriman suggested to me that in MongoDB, you can work around the lack of joins via client-side logic, or by embedding lots of data in each document (e.g., all the line items for an order) and extracting what you need via MapReduce jobs.
I’m not totally sure what I think about joinlessness, but one way of looking at it could be:
- The reason we have joins is because we normalize. If it’s OK to be highly denormalized, then it’s less important to have joins.
- When normalization is good and denormalization is bad, one or both of two reasons are commonly in play:
- The logical burden of keeping straight all the different places you’d have to update the same data is too great for the poor, overburdened programmers.
- The performance burden of doing all that updating is too great for the poor, overburdened hardware.
- For the logical reason to have great force, there has to be a pretty complex schema, or else a frequently changing one. But when schemas change frequently, relational designs have their own problems.
- The physical reason automatically has great force if you have huge update volumes and keep many copies of the same data. Otherwise, its strength has a lot to do with the specific architecture of the DBMS. E.g., if it’s a lot cheaper to update a small record than a big one, short rows are better. But otherwise, denormalization may not have that much effect on performance.
Putting all that together, I’m inclined to think that for many applications, it’s OK to denormalize, or to have such a simple schema that normalization is moot. But even so, I’d be a lot more comfortable if a DBMS offered at least some way of doing a join.
All this raises a related question: What are transactions like in document-oriented DBMS? I’ve never pushed the point with MarkLogic, but when they talk of their ACID compliance they sound as if they are using the phrase in the usual way. MongoDB only lets you do transactions in single documents. I’ve never asked the question about CouchDB, but I do note with interest CouchDB’s “crash-only” architecture, which boils down to:
- CouchDB shutdown is “instantaneous.”
- You can only shut down CouchDB by crashing it.
- There’s no way to shut down or crash CouchDB that causes data to be inconsistent.