October 30, 2009

A question on MDX performance

An enterprise user wrote in with a question that boils down to:

What are reasonable MDX performance expectations?

MDX doesn’t come up in my life very much, and I don’t have much intuition about it. E.g., I don’t know whether one can slap an MDX-to-SQL converter on top of a fast analytic RDBMS and go to town. What’s more, I’m heading off on vacation and don’t feel like researching the matter myself in the immediate future. 🙂

So here’s the long form of the question. Any thoughts?

I have a general question on assessing the performance of an OLAP technology using a set of MDX queries. I would be interested to know if there are any benchmark MDX performance tests/results comparing different OLAP technologies (which may be based on different underlying DBMS’s if appropriate) on similar hardware setup, or even comparisons of complete appliance solutions. More generally, I want to determine what performance limits I could reasonably expect on what I think are fairly standard servers.

In my own work, I have set up a star schema model centered on a Fact table of 100 million rows (approx 60 columns), with dimensions ranging in cardinality from 5 to 10,000. In ad hoc analytics, is it expected that any query against such a dataset should return a result within a minute or two (i.e. before a user gets impatient), regardless of whether that query returns 100 cells or 50,000 cells (without relying on any aggregate table or caching mechanism)? Or is that level of performance only expected with a high end massively parallel software/hardware solution? The server specs I’m testing with are: 32-bit 4 core, 4GB RAM, 7.2k RPM SATA drive, running Windows Server 2003; 64-bit 8 core, 32GB RAM, 3 Gb/s SAS drive, running Windows Server 2003 (x64).

I realise that caching of query results and pre-aggregation mechanisms can significantly improve performance, but I’m coming from the viewpoint that in purely exploratory analytics, it is not possible to have all combinations of dimensions calculated in advance, in addition to being maintained.

September 30, 2009

Facts and rumors

July 27, 2009

XtremeData announces its DBx data warehouse appliance

XtremeData is announcing its DBx data warehouse appliance today. Highlights include: Read more

July 8, 2009

While I’m venting about benchmarks

Late last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark.  I wrote at the time that this showed Vertica wasn’t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.

Well, guess what?  In a throwaway line in a comment on Daniel Abadi’s blog, Barry Zane of ParAccel pointed out

we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own

Quite right.

I hope the nonsense stops there, but I’m not optimistic …

July 7, 2009

Daniel Abadi has a theory about ParAccel

When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms.  (Typical comment: “Why did they present a paper about that? We were doing the same thing in our company years ago.”) That doesn’t prove much per se, since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own unflattering ParAccel-related comments were rather fresh at the time.

But now Daniel Abadi has done a brilliant, detailed, speculative analysis of ParAccel’s publications.  Here’s the meat, emphasis mine: Read more

July 2, 2009

The TPC-H schema

Would anybody recommend in real life running the TPC-H schema for that data? (I.e., fully normalized, no materialized views.) If so — why????

July 2, 2009

Notes on columnar/TPC-H compression

I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.*  When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).

*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.

But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace).  Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing.  So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.

And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.

June 23, 2009

ParAccel pricing

As I noted in connection with ParAccel’s recent TPC-H filing, I think the whole exercise is basically an expensive joke. But one slightly useful spin-off is that ParAccel disclosed pricing.  Specifically, ParAccel’s stated price in the disclosure document is:

Last year ParAccel quoted prices of $100,000/TB or $50,000/server.  The latter figure would seem to have led to lower numbers on the benchmark configuration, so perhaps it’s no longer an option on ParAccel’s price list.

June 22, 2009

The TPC-H benchmark is a blight upon the industry

ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.

It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life.  The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.

For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.

Meanwhile, the RAM:data ratio is around 1:11  It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that.  And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.

More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.

March 2, 2009

Ideas for BI POCs

Kevin Spurway of Altosoft has a post up offering his suggestions on how to do business intelligence POCs (Proofs-of-Concept). Among the best ideas in his post are:

The post’s worst, or at least most self-serving, idea is:

Of course, he didn’t phrase it exactly that way, but that was the gist.

Actually, the more realistically your POC models:

the more reliable it will be.

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers


Warning: include(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(http://www.monash.com/blog-promo.php): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(): Failed opening 'http://www.monash.com/blog-promo.php' for inclusion (include_path='.:/usr/lib/php:/usr/local/lib/php') in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29