December 10, 2015

Readings in Database Systems

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as: 

My responses start:

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

Related links


9 Responses to “Readings in Database Systems”

  1. Mark Callaghan on December 10th, 2015 4:16 pm

    I agree that most OLTP deployments can use an in-memory DBMS when “can” only considers the database size and RAM available on large DRAM servers and small clusters. But are new deployments choosing to do that?
    Maybe Amazon has data on that from EC2 customers.

    My perception from people making this claim is that it ignores whether customers:
    * are willing to pay for it
    * are willing to pay for the power for it
    * have large DRAM servers available to handle it

    It is also easier for new deployments to go in-memory. The question is whether the will be in-memory several years later when they have too much data.

  2. Curt Monash on December 10th, 2015 8:15 pm

    Hi Mark,

    There are several different issues mixed together in your skepticism, and rightly so. For starters:

    1. Existing OLTP systems work over existing OLTP DBMS. Migration, to borrow a phrase from Zork, is slow and tedious.

    2. For a long time, many systems have been configured so that most accesses only go to RAM. If memory serves, the figure for SAP a decade ago was 99 1/2%. (However, fraction of accesses and fraction of work in accessing may be two very different things …)

    3. If OLTP data is of the most classical kind — records of transactions engaged in by humans — then the databases and their growth are limited in size by the amount of actual business activity they track. Your challenge of “Won’t they soon outgrow RAM?” applies mainly to apps with a strong machine-generated aspect and/or to ones that capture interactions more than just actions.

  3. clive boulton on December 11th, 2015 6:24 pm

    “Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.”

  4. clive boulton on December 11th, 2015 8:32 pm

    “Nested data structures are more important than Mike’s discussion seems to suggest.”

    Permanodes provide modeling, storing, searching, sharing & syncing data in the post-PC era

  5. Abstract datatypes and extensible RDBMS | Software Memories on December 12th, 2015 6:52 am

    […] my recent Stonebraker-oriented post about database theory and practice over the decades, I […]

  6. David Gruzman on December 18th, 2015 2:42 am

    I think that JSON stores like MongoDB / ElasticSearch are filling very important need.
    Data in many cases is hierarchical by its nature. When it is arriving with large stream – there is no time to normalize it into tables. When some event arrive with a bit non-standard structure we can not start redesign of our schema. We can not throw it either…
    Regarding HDFS – I also not agree that it is bad. I remember tons of critiques: It is slow, single point of failure, read only …
    I know very little real life problems with it. There is not much things with the same level of robustness. It solves the problem after all.

    MapReduce and its generalization as FlumeJava (for Googlers) and Spark (for the rest of us) are giving us capability to build our own execution plans.
    We can also call them directed acylic execution graphs of computations. I see proliferation of hints in the traditional databases as a proof that capability to write own execution plan as serious need…

  7. Aaron on December 21st, 2015 11:24 am

    OLTP tend to evolve over time to hybrid “analytic” systems with both singleton read write characteristics (e.g., order form entry) and more advanced things like lookup (e.g., view order history.) Singleton is completely solved for human interface – RDBMS with disk, RDBMS with SSD, KVP all work. The IO stress *always* comes from analytic parts, and there are loads of solutions. OLTP is now down to cost and support models and analytics integration.

  8. Aaron on December 21st, 2015 5:30 pm

    There are strong use cases for M/R on the other side of spectrum from RDBMS – roll your own. If your application is big enough it will break the commercial options, and Hadoop solves a fundamental problem: schedule big batch work, and let the programmer drive. That is why it is the first option for someone betting a business on a complex new app (and why pros go nuts looking at those systems.)

    The arrays/nested data structures/hierarchical data feeds and store mentioned and schema on read solve for a couple of other issues. Some data fits poorly into sets – time series (a common use for arrays) and inconsistent hierarchies (triple stores is a common use.) This fits David’s example – it’s really use of Hadoop *as a building block* for a customized use-case specific data management system.

    This is very similar to OODBMS back when db as a toolset rather than a product (and you are responsible for locking or consistency) – except now we have scale and cost and data as leverage.

    Expect the world to divide into the SQL on Hadoop on-the-box uses, and the exceptional big data as a differenciator exceptions.

  9. Paul Johnson on March 31st, 2016 3:36 am

    Agree that BI is important, and will continue to be (durr); that machine learning is becoming more important; that abstract data type usefulness is over-stated; that UDF usefulness is over-stated.

    The difficulties around “First we build the perfect schema” data warehouse projects cannot be over-stated.

    This is the subject of a blog article that I’ve been writing in my head for a while. May this is the reminder I needed!

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.