February 19, 2008

Mike Stonebraker may be oversimplifying data warehousing just a tad

Mike Stonebraker has now responded to the second post in my five-part database diversity series. Takeaways and rejoinders include:

I obviously wasn’t clear when I talked about two major competitive relational challenges to Oracle, et al. I simply was referring to

  1. Mid-range relational DBMS and
  2. High-end analytic DBMS

Earlier I thought Mike was forgetting about the distinction between high-end and mid-range RDBMS. Naturally, that didn’t last long. He’s actually calling the mid-range systems “open source”, but that’s a decent first approximation to a hard-to-define category.

My real reservations about Mike’s post lie in the area of analytic DBMS. Mike points out that there are two kinds — row-based (which he thinks are destined to be obsoleted) and column-based (which he thinks are destined over time to run “the vast majority of analytic workloads”). Now, his predictions may eventually come true. But row stores dominate the specialty data warehouse DBMS market today.

Wha’s more, some major use cases such as data mining or on-the-fly scoring look inherently row-centric to me. Also, consider website personalization. It calls for pinpoint data lookup, integrated with analytics. Will that eventually be done by beefed-up OLTP systems? Stream processors? Column stores? Analytic row stores? None of the possibilities can yet be ruled out. Indeed, I’m not sure we can even make a good start on predicting the ultimate answer unless we first figure out what will be done in RAM, and what will continue to be driven from disk.

Speaking of assumptions, there’s a major sub-text coloring all these discussions. Stonebraker is on record claiming that a vast majority of data warehouses (he uses figures up to 99%) have or should have single-fact-table schemas. Indeed, Mike’s columnar product Vertica hasn’t yet been enhanced to handle anything but the single fact table scenario. While that certainly fits a lot of applications, it also leaves a lot out. Profitability calculations like those Kalido specializes in will have one fact table for revenue, but others for costs or margin deductions. Marketing warehouses might have one fact table each for fundamentally different kinds of customer contact (web, phone, etc.), plus one for actual transactions, plus one for external data.

This may ultimately be a distinction without a difference, in that a system well designed for 1 fact table will also do a good job on N fact tables, as long as the N tables have a shared key (e.g., customer ID) that can be used to simultaneously partition them. But it illustrates that columnar systems haven’t proved their eventual dominance quite yet. And if we’re looking at current and near-future use, row-based specialty data warehouse systems still have a huge role to play.

The database diversity series so far

Comments

2 Responses to “Mike Stonebraker may be oversimplifying data warehousing just a tad”

  1. Oliver Ratzesberger on February 20th, 2008 4:53 am

    There is also the analytical workload on tables with VERY few columns: e.g. 1 column text. Think 100 billion lines of text. Works very well on certain platforms. Does that make it a row or column store or both? ;-)
    On the analytical db architecture I would agree that there will be many shades of gray. The single fact table approach just doesn’t allow for holding all – and I mean all – of your atomic data in your dw environment. Why would you want to hold all of it? Because its the most intrinsic level of data you can get. All metrics, fact and KPIs derive from there – like the center of the universe. As long as you keep all atomic details you can always go back in time and come up with new KPIs and facts. If you keep doing that fact table design you always start from day 0 the moment you implement a new metric. Gets boring really fast.
    And one can’t assume common keys either unless you are talking highly specialized departmental data marts. As you point out you quickly get to far too many dimensions, granularities and hierarchies that require more normalized data structures to hold them. While I can see use cases for column stores, I agree that there might just be as many use cases for row based dbs.
    And there is really nothing from preventing hybrid solutions…

  2. Curt Monash on February 20th, 2008 5:06 am

    Oliver,

    I posted a while back on DATAllegro’s rejoinder to the columnar guys — they just partition vertically, getting many of the same benefits. Of course, this isn’t quite as powerful or general in their case as it is in the case of columnar systems, but they have offsetting advantages.

    By the way, I think the ball’s in your court on a phone call.

    Best,

    CAM

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.