There’s a growing trend for DBMS to beef up their support for multiple data manipulation languages (DMLs) or APIs — and there’s a special boom in JSON support, MongoDB-compatible or otherwise. So I talked earlier tonight with IBM’s Bobbie Cochrane about how JSON is managed in DB2.
For starters, let’s note that there are at least four strategies IBM could have used.
- Store JSON in a BLOB (Binary Large OBject) or similar existing datatype. That’s what IBM actually chose.
- Store JSON in a custom datatype, using the datatype extensibility features DB2 has had since the 1990s. IBM is not doing this, and doesn’t see a need to at this time.
- Use DB2 pureXML, along with some kind of JSON/XML translator. DB2 managed JSON this way in the past, via UDFs (User-Defined Functions), but that implementation is superseded by the new BLOB-based approach, which offers better performance in ingest and query alike.
- Shred — to use a term from XML days — JSON into a bunch of relational columns. IBM experimented with this approach, but ultimately rejected it. In dismissing shredding, Bobbie also disdained any immediate support for schema-on-need.
IBM’s technology choices are of course influenced by its use case focus. It’s reasonable to divide MongoDB use cases into two large buckets:
- Hardcore internet and/or machine-generated data, for example from a website.
- Enterprise data aggregation, for example a “360-degree customer view.”
IBM’s DB2 JSON features are targeted at the latter bucket. Also, I suspect that IBM is generally looking for a way to please users who enjoy working on and with their MongoDB skills.
More technically speaking, IBM is assuming somewhat simpler JSON than might be stored in MongoDB. Notes on that include:
- IBM doesn’t encourage some of the extreme denormalization that goes on in joinless MongoDB.
- If more complex JSON becomes a big deal in DB2, IBM could reconsider its decision not to optimize via a custom datatype.
Specifics of IBM’s JSON read/write story in DB2 include:
- Right now the main API seems to be MongoDB compatibility, which seems to be fairly complete except for a couple of what IBM is sure are just “edge cases”.
- IBM plans to embed some JSON syntax into SQL, so that JSON fields can be referenced much as regular columns are.
- Indexes are via a feature in the recent DB2 10.5 that lets you index on expressions.
- Updates write to the whole BLOB.
- You can extract JSON leaf values directly. This capability is implemented via a fenced* UDF. For more complex extractions you have to bring back the whole BLOB. (IBM could fix that in a subsequent release via an unfenced* UDF, if it chooses.)
DB2 also has some capability to traverse JSON, but I’m not clear as to where exactly that kicks in.
*The fenced vs. unfenced distinction for analytic platforms refers to out-of-process vs. in-process execution.
And finally, some of IBM’s choices are easier to understand in the context of a DB2 concept called “side tables”. As best I understand:
- DB2 side tables are tables that are somewhat hidden from users, DBAs, and DBA tools. Thus, accommodating side tables in, for example, backup procedures is inconvenient, albeit not totally impossible.
- Shredding makes heavy use of side tables, which is a big part of why IBM disfavors it.
- The actual DB2 JSON implementation involves a single side table — I think to track the indexes — which is one more side table than IBM thinks should be involved long term.