Two years ago I wrote about how Zynga managed analytic data:
Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.
What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:
- Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
- The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
- Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.
That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done.
For years people have been putting data into DBMS (usually but not exclusively relational ones), building some indexes immediately, then adding more indexes to improve performance later as requirements are discovered. Materialized views play a similar role. Schema-on-need is a continuation of the same idea, but targeted at poly-structured data.
Why not just materialize all possible columns immediately? There are two main reasons:
- The result would be unwieldy and sparse. In some nested data structure cases, it seems that billions of columns could be needed. (I haven’t checked that estimate on my own, but it sounds plausible.)
- While materializing a column helps query performance, it slows down writes. (Indexing generally has the same tradeoffs.)
Hadapt’s approach to schema-on-need
Hadapt has a particularly interesting approach to schema-on need. As you may recall, Hadapt adds software to an ordinary Hadoop cluster so that it also functions as a decent analytic DBMS. One of those additions is a copy of PostgreSQL on every node, and in its next patch, Hadapt will incorporate modifications to PostgreSQL that are designed to support schema-on-need. More precisely, Hadapt will offer transparent support for physical schema-on-need, so that logically you can have a fairly tabular schema in place from the get-go.
The key elements of Hadapt’s approach are:
- A custom datatype to accommodate and serialize sets of key-value pairs. Each key equates to a logical column.
- A redirector that points query execution steps at physical or logical columns as the case may be.
Indeed, a Hadapt column can even be part physical, part logical. The main use for this is in cases when most but not all values in a column have the same datatype. The data of majority datatype then goes into the physical column, while the rest is stored in the catch-all key-value column.
In a naive version of this strategy, queries would involve retrieving a lot of rows that each have a wide field, then scanning the wide field to extract a particular keyed value. That could be slow and tedious. Hadapt’s hacks to mitigate that performance problem include:
- The keys are encoded together at the start of the field to be fixed-length. Hadapt says this allows for fast binary search.
- Stored along with the keys are offsets to indicate where the (variable-length) values can be found.
And yes — Hadapt originally looked into using PostgreSQL’s JSON and key-value datatypes, but determined they were much too slow for these purposes.
This is new technology, so of course there are various rough edges.
- While the key-value pairs could come from JSON, XML or other sources, Hadapt doesn’t currently offer much support for JSON- or XML-based data manipulation languages, basic ingest perhaps excepted. Rather, the data is just sets of key-value pairs, which may or may not have interesting naming conventions for the keys.
- The PostgreSQL optimizer and query planner are in the dark; they just assume all columns are physical. Ditto the global Hadapt query engine.
- Transitioning columns from virtual to physical status is still a very manual process.
- If you try to do something on a multi-datatype column that could get you into trouble, there isn’t much protection in place against the resulting misfortunes.
Whether the missing features are soon added will depend in part on whether Hadapt commits to a strategy such as “analytic DBMS for multi-structured data”. I think they will, but it’s too early to be certain.
What about short-request schema-on-need?
As I’ve described it, schema-on-need seems focused on analytic/query-mainly use cases. Still, the “real-time” analytics boom suggests it may be relevant on the short-request side too. Stay tuned as I try to figure out just what is and isn’t happening in the relational+JSON world, for example in products such as MemSQL or DB2.
- Hadapt’s architecture (June, 2013, with links to earlier posts)
- DBMS with multiple DMLs (September, 2013)
- The relational/non-tabular tradeoff (January, 2013)
- Dynamic schemas (July, 2011)