July 2, 2009

The TPC-H schema

Would anybody recommend in real life running the TPC-H schema for that data? (I.e., fully normalized, no materialized views.) If so — why????

Comments

13 Responses to “The TPC-H schema”

  1. Jerome Pineau on July 2nd, 2009 3:10 pm

    It appears the Star Schema Benchmark (SSB) [O’Neil x 2 and Chen] might be more realistic. Even the DS is better although my intuition tells me that one will never see the day.

  2. Justin Swanhart on July 2nd, 2009 4:23 pm

    The SSB is unrealistic IMHO as it requires that absolutely no caching be used. You pretty much have to restart the db (preferably the whole server) between each of the queries.

    SSB is an academic benchmark designed to test star schema optimization, and star schema optimization only. They encourage materialization, btw. In fact the PHD candidates who built the benchmark did so in order to test a new materialization strategy!

    Also the SSB dbgen tool is buggy and crashes on both windows and linux if you try to generate more than one table at a time.

    The biggest problem with materialized views, is that if you don’t choose the right ones early on, it is very difficult to build them later as it takes forever to materialize the results in the first place. This is also a problem when reloading a large warehouse after a disaster, or loading a very large db the first time.

    Materialization can go a long way in improving performance, but the costs to materialize data can be very large and they have to be calculated into the equation.

  3. Justin Swanhart on July 2nd, 2009 4:26 pm

    Also, as the SSB benchmark designers note, the TPC-H schema is whack anyway. There is a granularity mismatch between some of the tables.

    I’ve seem similar schemas to TPC-H(tm) used in real life though. It sits somewhere between a pure OLTP schema and a pure DW schema, and I’m pretty sure there are lots of reporting databases that fit that description.

  4. Curt Monash on July 2nd, 2009 4:37 pm

    Actually, I tend to frown on materialized views, on the level that if you need more than a very few of them, you’d probably be better off w/ a faster DBMS that doesn’t need as many and hence has much less of an administrative burden.

    Similarly, in cases where Justin’s critique is applicable, that would seem to imply establishing the MVs is VERY expensive. But a MV is really just a big query. So if running a big query is stupifyingly slow … again, maybe you’re on the wrong platform.

    Infobright-like systems that automagically create quasi-MVs on the fly may be excused from part or all of this criticism …

  5. Jerome Pineau on July 2nd, 2009 5:49 pm

    “Actually, I tend to frown on materialized views, on the level that if you need more than a very few of them, you’d probably be better off w/ a faster DBMS that doesn’t need as many and hence has much less of an administrative burden.”

    Ohh I am going to quote this one 🙂 In light of my just posted quip on http://jeromepineau.blogspot.com

  6. Justin Swanhart on July 2nd, 2009 9:11 pm

    It all depends on the system. If you are using materialized views, it usually isn’t for convenience, it is for performance. Building a materialized view might take a long time, but in the long run, if you can amortize the cost of maintaining the view over time using incremental materialization, then it is time well spent. I’d rather run a query which takes 24 hours once, then spend 15 minutes per day maintaining it, than run it every day.

    Very few databases and tools support incremental materialization though. It is not a trivial problem.

    Another problem is actually using the mviews. If you have a tool like mondrian which understands how to write queries to access the materialized data, then you are set. Oracle supports materialized view rewrite which does it automatically as long as you define hierarchies. Otherwise you have to rewrite your queries to access the materializations which is inconvenient at best.

  7. Justin Swanhart on July 2nd, 2009 9:15 pm

    “Actually, I tend to frown on materialized views, on the level that if you need more than a very few of them, you’d probably be better off w/ a faster DBMS that doesn’t need as many and hence has much less of an administrative burden.”

    I agree, but this is problematic ff you want to keep your existing tools, scripts, etc. You are kind of stuck because it is hard to change databases. This is why Kickfire is great, because if you are already running MySQL, just about everything you are used to doing is going to work similarly or exactly the same on Kickfire.

  8. Curt Monash on July 3rd, 2009 4:15 am

    Justin,

    As per my various posts on database emulation/portability, vendors who boast such features tend to think they are much more important than customers do. 🙂

  9. Greg Rahn on July 6th, 2009 12:03 am

    @Curt

    In asking such a question are you suggesting there is something fundamentally wrong with the TPC-H schema?

    While one could argue there are better ways to model the TPC-H schema, isn’t the reality that most data warehouses probably “suffer” from a less than academically perfect data model? I guess this is quite contrary to the findings of Michael Stonebraker et al.

    In interviewing about two dozen CIOs, the authors have never seen a warehouse that did not use a snowflake schema.

  10. Curt Monash on July 6th, 2009 1:53 am

    Greg,

    I talked about that snowflake-only claim with Daniel Abadi just this week at SIGMOD. He says that the CIO conversations really happened that way, and now calls it a bad sample.

    As for the rest — could you please fill in a few more details of your straw man? I think you’re hypothesizing sites that are foolishly wedded to theory, have few challenges in update latency, have hugely demanding requirements in performance, and can’t be bothered to look at more than a single benchmark number in evaluating a multimillion dollar purchase. Did I understand correctly?

  11. Greg Rahn on July 6th, 2009 11:57 am

    @Curt

    I believe you missed the question from my previous comment: Do you feel there is something fundamentally wrong with the TPC-H schema design?

    To answer your question: “Did I [Curt] understand correctly?”. Based on your response, it would appear not.

    I’m struggling to understand how go from my comment:

    …isn’t the reality that most data warehouses probably “suffer” from a less than academically perfect data model…

    to your comment

    I think you’re hypothesizing sites that are foolishly wedded to theory…

    Those seem like completely orthogonal thoughts to me. No?

    In making my comment I was suggesting that if you indeed feel there is a better design for the TPC-H schema, that very well may be. However, I believe that many existing data warehouse data models could be improved (in an academic sense), but the reality of the situation is they exist, thus TPC-H as-is, is probably more representative of real-world data warehouses than an academically designed schema.

    Hopefully that clears up your misunderstanding.

  12. Curt Monash on July 6th, 2009 4:57 pm

    Hi Greg,

    I’m honestly not sure one way or the other about the mix of data warehouses in the real world. There are examples of just about anything. There’s clearly a lot of potential value to the proposition “Run with your old schema, but a lot faster, in addition to adding new tables and queries.” But I can’t think of a single case that has both the properties:

    A. Needs absolutely the most screaming tippy-top raw performance.
    B. Has a schema with no performance optimizations.

    I just can’t think of an actual real-world case that comes anywhere close to the tradeoffs and requirements of the TPC-H.

  13. Historical significance of TPC benchmarks | Software Memories on March 31st, 2010 4:38 am

    […] couple of recent conversations about the TPC-H benchmark.  Some people suggest that, while almost untethered from real-world computing, TPC-Hs inspire real world product improvements.  Richard Gostanian even […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.