July 7, 2009

Daniel Abadi has a theory about ParAccel

When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms.  (Typical comment: “Why did they present a paper about that? We were doing the same thing in our company years ago.”) That doesn’t prove much per se, since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own unflattering ParAccel-related comments were rather fresh at the time.

But now Daniel Abadi has done a brilliant, detailed, speculative analysis of ParAccel’s publications.  Here’s the meat, emphasis mine: Read more

July 2, 2009

The TPC-H schema

Would anybody recommend in real life running the TPC-H schema for that data? (I.e., fully normalized, no materialized views.) If so — why????

July 2, 2009

Notes on columnar/TPC-H compression

I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.*  When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).

*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.

But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace).  Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing.  So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.

And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.

June 23, 2009

ParAccel pricing

As I noted in connection with ParAccel’s recent TPC-H filing, I think the whole exercise is basically an expensive joke. But one slightly useful spin-off is that ParAccel disclosed pricing.  Specifically, ParAccel’s stated price in the disclosure document is:

Last year ParAccel quoted prices of $100,000/TB or $50,000/server.  The latter figure would seem to have led to lower numbers on the benchmark configuration, so perhaps it’s no longer an option on ParAccel’s price list.

June 22, 2009

The TPC-H benchmark is a blight upon the industry

ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.

It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life.  The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.

For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.

Meanwhile, the RAM:data ratio is around 1:11  It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that.  And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.

More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.

March 2, 2009

Ideas for BI POCs

Kevin Spurway of Altosoft has a post up offering his suggestions on how to do business intelligence POCs (Proofs-of-Concept). Among the best ideas in his post are:

The post’s worst, or at least most self-serving, idea is:

Of course, he didn’t phrase it exactly that way, but that was the gist.

Actually, the more realistically your POC models:

the more reliable it will be.

February 25, 2009

Even more final version of my TDWI slide deck

My TDWI talk on How to Select an Analytic DBMS starts in less than an hour.  So the latest version of my slide deck should prove truly final, unlike my prior two.

I won’t have printouts or other access to my notes, so those aren’t a good guide to the actual verbiage I’ll use.

February 25, 2009

Partial overview of Ab Initio Software

Ab Initio is an absurdly secretive company, as per a couple of prior posts and the comment threads on same. But yesterday at TDWI I actually found civil people staffing an Ab Initio trade show booth. Based on that conversation and other tidbits, I think it’s fairly safe to say: Read more

February 23, 2009

MapReduce user eHarmony chose Netezza over Aster or Greenplum

Depending on which IDG reporter you believe, eHarmony has either 4 TB of data or more than 12 TB, stored in Oracle but now analyzed on Netezza.  Interestingly, eHarmony is a Hadoop/MapReduce shop, but chose Netezza over Aster Data or Greenplum even so.  Price was apparently an important aspect of the purchase decision. Netezza also seems to have had a very smooth POC. Read more

February 18, 2009

My TDWI Night School course Wednesday night

I imagine everybody who’s actually going to TDWI knows how to read an agenda.  But in case you missed it, I’ll be holding forth Wednesday night on How to Select an Analytic DBMS.  I’ve already posted the slides.

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.