Comments on: The TPC-H schema

By: Historical significance of TPC benchmarks | Software Memories

Historical significance of TPC benchmarks | Software Memories — Wed, 31 Mar 2010 08:38:17 +0000

[…] couple of recent conversations about the TPC-H benchmark. Some people suggest that, while almost untethered from real-world computing, TPC-Hs inspire real world product improvements. Richard Gostanian even […]

By: Curt Monash

Curt Monash — Mon, 06 Jul 2009 20:57:52 +0000

Hi Greg,

I’m honestly not sure one way or the other about the mix of data warehouses in the real world. There are examples of just about anything. There’s clearly a lot of potential value to the proposition “Run with your old schema, but a lot faster, in addition to adding new tables and queries.” But I can’t think of a single case that has both the properties:

A. Needs absolutely the most screaming tippy-top raw performance.
B. Has a schema with no performance optimizations.

I just can’t think of an actual real-world case that comes anywhere close to the tradeoffs and requirements of the TPC-H.

By: Greg Rahn

Greg Rahn — Mon, 06 Jul 2009 15:57:16 +0000

@Curt

I believe you missed the question from my previous comment: Do you feel there is something fundamentally wrong with the TPC-H schema design?

To answer your question: “Did I [Curt] understand correctly?”. Based on your response, it would appear not.

I’m struggling to understand how go from my comment:

…isn’t the reality that most data warehouses probably “suffer” from a less than academically perfect data model…

to your comment

I think you’re hypothesizing sites that are foolishly wedded to theory…

Those seem like completely orthogonal thoughts to me. No?

In making my comment I was suggesting that if you indeed feel there is a better design for the TPC-H schema, that very well may be. However, I believe that many existing data warehouse data models could be improved (in an academic sense), but the reality of the situation is they exist, thus TPC-H as-is, is probably more representative of real-world data warehouses than an academically designed schema.

Hopefully that clears up your misunderstanding.

By: Curt Monash

Curt Monash — Mon, 06 Jul 2009 05:53:05 +0000

Greg,

I talked about that snowflake-only claim with Daniel Abadi just this week at SIGMOD. He says that the CIO conversations really happened that way, and now calls it a bad sample.

As for the rest — could you please fill in a few more details of your straw man? I think you’re hypothesizing sites that are foolishly wedded to theory, have few challenges in update latency, have hugely demanding requirements in performance, and can’t be bothered to look at more than a single benchmark number in evaluating a multimillion dollar purchase. Did I understand correctly?

By: Greg Rahn

Greg Rahn — Mon, 06 Jul 2009 04:03:38 +0000

@Curt In asking such a question are you suggesting there is something fundamentally wrong with the TPC-H schema? While one could argue there are better ways to model the TPC-H schema, isn't the reality that most data warehouses probably "suffer" from a less than academically perfect data model? I guess this is quite contrary to the findings of Michael Stonebraker et al.

In interviewing about two dozen CIOs, the authors have never seen a warehouse that did not use a snowflake schema.

By: Curt Monash

Curt Monash — Fri, 03 Jul 2009 08:15:51 +0000

Justin,

As per my various posts on database emulation/portability, vendors who boast such features tend to think they are much more important than customers do. 🙂

By: Justin Swanhart

Justin Swanhart — Fri, 03 Jul 2009 01:15:44 +0000

“Actually, I tend to frown on materialized views, on the level that if you need more than a very few of them, you’d probably be better off w/ a faster DBMS that doesn’t need as many and hence has much less of an administrative burden.”

I agree, but this is problematic ff you want to keep your existing tools, scripts, etc. You are kind of stuck because it is hard to change databases. This is why Kickfire is great, because if you are already running MySQL, just about everything you are used to doing is going to work similarly or exactly the same on Kickfire.

By: Justin Swanhart

Justin Swanhart — Fri, 03 Jul 2009 01:11:35 +0000

It all depends on the system. If you are using materialized views, it usually isn’t for convenience, it is for performance. Building a materialized view might take a long time, but in the long run, if you can amortize the cost of maintaining the view over time using incremental materialization, then it is time well spent. I’d rather run a query which takes 24 hours once, then spend 15 minutes per day maintaining it, than run it every day.

Very few databases and tools support incremental materialization though. It is not a trivial problem.

Another problem is actually using the mviews. If you have a tool like mondrian which understands how to write queries to access the materialized data, then you are set. Oracle supports materialized view rewrite which does it automatically as long as you define hierarchies. Otherwise you have to rewrite your queries to access the materializations which is inconvenient at best.

By: Jerome Pineau

Jerome Pineau — Thu, 02 Jul 2009 21:49:08 +0000

Ohh I am going to quote this one 🙂 In light of my just posted quip on http://jeromepineau.blogspot.com

By: Curt Monash

Curt Monash — Thu, 02 Jul 2009 20:37:29 +0000

Actually, I tend to frown on materialized views, on the level that if you need more than a very few of them, you’d probably be better off w/ a faster DBMS that doesn’t need as many and hence has much less of an administrative burden.

Similarly, in cases where Justin’s critique is applicable, that would seem to imply establishing the MVs is VERY expensive. But a MV is really just a big query. So if running a big query is stupifyingly slow … again, maybe you’re on the wrong platform.

Infobright-like systems that automagically create quasi-MVs on the fly may be excused from part or all of this criticism …