September 21, 2013

Schema-on-need

Two years ago I wrote about how Zynga managed analytic data:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.

What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:

That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done. 😉

For years people have been putting data into DBMS (usually but not exclusively relational ones), building some indexes immediately, then adding more indexes to improve performance later as requirements are discovered. Materialized views play a similar role. Schema-on-need is a continuation of the same idea, but targeted at poly-structured data.

Why not just materialize all possible columns immediately? There are two main reasons:

Hadapt’s approach to schema-on-need

Hadapt has a particularly interesting approach to schema-on need. As you may recall, Hadapt adds software to an ordinary Hadoop cluster so that it also functions as a decent analytic DBMS. One of those additions is a copy of PostgreSQL on every node, and in its next patch, Hadapt will incorporate modifications to PostgreSQL that are designed to support schema-on-need. More precisely, Hadapt will offer transparent support for physical schema-on-need, so that logically you can have a fairly tabular schema in place from the get-go.

The key elements of Hadapt’s approach are:

Indeed, a Hadapt column can even be part physical, part logical. The main use for this is in cases when most but not all values in a column have the same datatype. The data of majority datatype then goes into the physical column, while the rest is stored in the catch-all key-value column.

In a naive version of this strategy, queries would involve retrieving a lot of rows that each have a wide field, then scanning the wide field to extract a particular keyed value. That could be slow and tedious. Hadapt’s hacks to mitigate that performance problem include:

And yes — Hadapt originally looked into using PostgreSQL’s JSON and key-value datatypes, but determined they were much too slow for these purposes.

This is new technology, so of course there are various rough edges.

Whether the missing features are soon added will depend in part on whether Hadapt commits to a strategy such as “analytic DBMS for multi-structured data”. I think they will, but it’s too early to be certain.

What about short-request schema-on-need?

As I’ve described it, schema-on-need seems focused on analytic/query-mainly use cases. Still, the “real-time” analytics boom suggests it may be relevant on the short-request side too. Stay tuned as I try to figure out just what is and isn’t happening in the relational+JSON world, for example in products such as MemSQL or DB2.

Related links

Comments

10 Responses to “Schema-on-need”

  1. Steve Florek on September 22nd, 2013 11:19 am

    Isn’t this the Return of the Object-Relational Database, this time starring JSON as the object representation? What seems to be popular today is “lazy impedance matching”–only create relational projections of subsets of data as needed for analytical purposes, instead of OR’s grand unification approach targeted towards application data persistence.

  2. Curt Monash on September 22nd, 2013 12:10 pm

    Steve,

    The ORDBMS wound up just being a general framework of accommodating new datatypes. Schema-on-need is a particular way of making use of that framework.

    Separately, the framework isn’t exactly right for the extensibility needs, so it’s not exactly correct to say that all that’s happening here is a use of the general extensibility framework.

  3. Hadoop startup Hadapt pushing ‘schema-less SQL’ — Tech News and Analysis on September 23rd, 2013 12:55 pm

    […] joint SQL-NoSQL support is likely to become a lot more normal for analytic databases. Curt Monash has a good technical breakdown of the new Hadapt feature. Related researchSubscriber Content? Subscriber Content comes from […]

  4. Phil Ayres on September 24th, 2013 4:10 am

    This sounds like an interesting approach. I would be interested to compare Hadapt to ElasticSearch, which is schema less, sucking up JSON and providing great query performance on any nested attribute. I have started to see others use this type of DB for previously transactional database requirements.

    Definitely another tech I need to research more closely.

  5. JSON in Teradata | DBMS 2 : DataBase Management System Services on October 24th, 2013 7:42 am

    […] coined the term schema-on-need last month. More precisely, I coined it while being briefed on JSON-in-Teradata, which was […]

  6. Vertica 7 | DBMS 2 : DataBase Management System Services on December 5th, 2013 2:50 pm

    […] Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica […]

  7. DBMS2 revisited | DBMS 2 : DataBase Management System Services on March 23rd, 2014 7:52 am

    […] Schema-on-read, and its smarter younger brother schema-on-need. […]

  8. Dynamic Late Binding Schemas on Need | Database Fog Blog on April 3rd, 2014 2:27 pm

    […] very much like Curt Monash’s posts on dynamic schemas and schema-on-need… here and here are two examples. They make me think… But I am missing something… I mean […]

  9. Thoughts on in-memory columnar add-ons | DBMS 2 : DataBase Management System Services on October 21st, 2014 12:59 am

    […] of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that […]

  10. Notes on analytic technology, May 13, 2015 | DBMS 2 : DataBase Management System Services on May 13th, 2015 10:38 pm

    […] If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need. […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.