Columnar database management

Analysis of products and issues in column-oriented database management systems. Related subjects include:

July 2, 2009

Notes on columnar/TPC-H compression

I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.*  When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).

*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.

But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace).  Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing.  So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.

And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.

June 22, 2009

The TPC-H benchmark is a blight upon the industry

ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results.  But TPCs are worthless today.

It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life.  The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.

For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.

Meanwhile, the RAM:data ratio is around 1:11  It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that.  And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.

More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.

June 8, 2009

Per-terabyte pricing

Software-only DBMS vendors sometimes price per terabyte of user data.  Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes.  In both cases, this means raw data, uncompressed, without counting indexes or temp space.

Client experience teaches me that this definition is easy to forget, so let me reemphasize the key point:

Per-terabyte pricing is based on a calculated figure.  Per-terabyte pricing is not based on the current disk space used by your database when managed by the DBMS you are replacing.

There’s at least one important difference in how Vertica and Greenplum calculate database size.  No matter how many times you copy the data, Vertica only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy.  Similarly, Vertica charges only for deployment, and not for test or development; I didn’t remember to ask what Greenplum’s policies are in those regards. (Edit: Greenplum says in a comment below that it doesn’t charge for test or development data either.)

*That policy is a great fit with Vertica’s performance recommendation that you should store columns in different sort orders, perhaps an average of two copies per column.

June 7, 2009

Merv Adrian on SAND Technology

Merv Adrian blogged about SAND Technology, casting significant doubt on SAND’s business prospects.  At this point, I can’t say I disagree. On the other hand, SAND does have public, audited financial statements showing it generating more revenue than a lot of other analytic DBMS or archiving vendors probably make. Columnar DBMS vendors doing better than SAND are Sybase, Vertica, maybe Infobright — and who else?

June 7, 2009

Daniel Abadi on Kickfire and related subjects

Daniel Abadi has a new blog, whose first post centers around Kickfire.  The money quote is (emphasis mine):

In order for me to get excited about Kickfire, I have to ignore Mike Stonebraker’s voice in my head telling me that DBMS hardware companies have been launched many times in the past are ALWAYS fail (the main reasoning is that Moore’s law allows for commodity hardware to catch up in performance, eventually making the proprietary hardware overpriced and irrelevant). But given that Moore’s law is transforming into increased parallelism rather than increased raw speed, maybe hardware DBMS companies can succeed now where they have failed in the past

Good point.

More generally, Abadi speculates about the market for MySQL-compatible data warehousing.  My responses include:

Anyhow, as previously noted, I’m a big Daniel Abadi fan. I look forward to seeing what else he posts in his blog, and am optimistic he’ll live up to or exceed its stated goals.

May 14, 2009

The secret sauce to Clearpace’s compression

In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.

To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method.  NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows.  Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.

Of the top of my head, examples for which this strategy might be particularly successful include:

April 20, 2009

MySQL storage engine round-up, with Oracle-related thoughts

Here’s what I know about MySQL storage engines, more or less.

April 20, 2009

Calpont update — you read it here first!

Calpont has gone through a lot of strategy iterations since its founding. The super-short version is that Calpont originally planned an appliance built around a SQL chip, much like Kickfire. But after various changes in management and venture backing, Calpont turned itself into a software-only analytic DBMS vendor relying on a MySQL front end. Calpont is now at the stage of announcing an Early Adopter program at the MySQL conference on Wednesday, although details of Calpont’s product release timing, pricing, feature set, etc. are all To Be Determined.

Minor highlights of the Calpont technical story include:

Read more

April 20, 2009

Infobright update

For the past couple of quarters, Infobright has been MySQL’s partner of choice for larger data warehousing applications. Infobright’s stated business metrics, and I quote, include:

  • > 50 Customers in 7 Countries

  • > 25 Partners on 4 continents

  • A vibrant open source community

    • +1 million visitors

    • Approaching 10,000 downloads

    • 2,000 active community participants

These may be compared with analogous metrics Infobright offered in February.

Infobright has also made or promised a variety of technological enhancements. Ones that are either shipping now or promised soon include:

Read more

February 23, 2009

The questionable benefits of terabyte-scale data warehouse virtualization

Vertica is virtualizing via VMware, and has suggested a few operational benefits to doing so that might or might not offset VMware’s computational overhead. But on the whole,it seems virtualization’s major benefits don’t apply to the large-database MPP data warehousing.

Read more

February 4, 2009

Draft slides on how to select an analytic DBMS

I need to finalize an already-too-long slide deck on how to select an analytic DBMS by late Thursday night.  Anybody see something I’m overlooking, or just plain got wrong?

Edit: The slides have now been finalized.

December 20, 2008

More grist for the column vs. row mill

Daniel Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing).  The gist is to recite a number of bases for superiority, beyond the two standard ones of less I/O and better compression, and seems to be based largely on Section 5 of a SIGMOD paper they wrote with Neil Hachem.

A big part of their argument is that if you carry the processing of columnar and/or compressed data all the way through in memory, you get lots of advantages, especially because everything’s smaller and hence fits better into Level 2 cache. There also is some kind of join algorithm enhancement, which seems to be based on noticing when the result wound up falling into a range according to some dimension, and perhaps using dictionary encoding in a way that will help induce such an outcome.

The main enemy here is row-store vendors who say, in effect, “Oh, it’s easy to shoehorn almost all the benefits of a column-store into a row-based system.”  They also take a swipe — for being insufficiently purely columnar — at unnamed columnar Vertica competitors, described in terms that seemingly apply directly to ParAccel.

December 16, 2008

Introduction to SAND Technology

SAND Technology has a confused history. For example:

SAND is publicly traded, so its numbers are on display. It turns out to be doing $7 million in annual revenue, and losing money.

OK. I just wanted to get all that out of the way. My main thoughts about the DBMS archiving market are in a separate post.

December 2, 2008

Data warehouse load speeds in the spotlight

Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:

The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now.

Read more

October 22, 2008

Introduction to Kickfire

I’ve spent a few hours visiting or otherwise talking with my new clients at Kickfire recently, so I think I have a better feel for their story. A few details are still missing, however, either because I didn’t get around to asking about them, or because an unexplained accident corrupted my notes (and I wasn’t even using Office 2007). Highlights include:

Read more

August 16, 2008

Exasol technical briefing

It took 5 ½ months after my non-technical introduction, but I finally got a briefing from Exasol’s technical folks (specifically, the very helpful Mathias Golombek and Carsten Weidmann). Here are some highlights: Read more

August 14, 2008

Patent nonsense in the data warehouse DBMS market

There are two recent patent lawsuits in the data warehouse DBMS market. In one, Sybase is suing Vertica. In another, an individual named Cary Jardin (techie founder of XPrime, a sort of predecessor company to ParAccel) is suing DATAllegro. Naturally, there’s press coverage of the DATAllegro case, due in part to its surely non-coincidental timing right after the Microsoft acquisition was announced and in part to a vigorous PR campaign around it. And the Sybase case so excited a troll who calls himself Bill Walters that he posted identical references to it on about 12 different threads in this blog, as well as to a variety of Vertica-related articles in the online trade press. But I think it’s very unlikely that any of these cases turn out to much matter. Read more

August 12, 2008

Compare/constrast of Vertica, ParAccel, and Exasol

I talked with Exasol today – at 5:00 am! — and of course want to blog about it. For clarity, I’d like to start by comparing/contrasting the fundamental data structures at Vertica, ParAccel, and Exasol. And it feels like that should be a separate post. So here goes.

Beyond the above, I plan to discuss in a separate post how Exasol does MPP shared-nothing software-only columnar data warehouse database management differently than Vertica and ParAccel do shared-nothing software-only columnar data warehouse database management. :)

August 8, 2008

Database compression coming to the fore

I’ve posted extensively about data-warehouse-focused DBMS’ compression, which can be a major part of their value proposition. Most notable, perhaps, is a short paper Mike Stonebraker wrote for this blog — before he and his fellow researchers started their own blog — on column-stores’ advantages in compression over row stores. Compression has long been a big part of the DATAllegro story, while Netezza got into the compression game just recently. Part of Teradata’s pricing disadvantage may stem from weak compression results. And so on.

Well, the general-purpose DBMS vendors are working busily at compression too. Microsoft SQL Server 2008 exploits compression in several ways (basic data storage, replication/log shipping, backup). And Oracle offers compression too, as per this extensive writeup by Don Burleson.

If I had to sum up what we do and don’t know about database compression, I guess I’d start with this:

Compression is one of the most important features a database management system can have, since it creates large savings in storage and sometimes non-trivial gains in performance as well. Hence, it should be a key item in any DBMS purchase decision.

August 6, 2008

Column stores vs. vertically-partitioned row stores

Daniel Abadi and Sam Madden followed up their post on column stores vs. fully-indexed row stores with one about column stores vs. vertically-partitioned row stores. Once again, the apparently reasonable way to set up the row-store database backfired badly.* Read more

August 5, 2008

Daniel Abadi and Sam Madden on column stores vs. indexed row stores

Daniel Abadi and Sam Madden — for whom I have the highest regard after our discussions regarding H-Store — wrote a blog post on Vertica’s behalf, arguing that column stores are far superior to fully-indexed row stores for not-very-selective queries. They link to a SIGMOD paper backing their argument up, provide some diagrams, and generally make a detailed case. As best I understand, here are some highlights: Read more

August 4, 2008

QlikTech/QlikView update

I talked with Anthony Deighton of memory-centric BI vendor QlikTech for an hour and a half this afternoon. QlikTech is quite the success story, with disclosed 2007 revenue of $80 million, up 80% year over year, and confidential year-to-date 2008 figures that do not disappoint as a follow-on. And a look at the QlikTech’s QlikView product makes it easy to understand how this success might have come about.

Let me start by reviewing QlikTech’s technology, as best I understand it.

Read more

May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

May 22, 2008

Netezza on compression

Phil Francisco put up a nice post on Netezza’s company blog about a month ago, explaining the Netezza compression story. Highlights include:

Read more

May 8, 2008

Vertica update

Another TDWI conference approaches. Not coincidentally, I had another Vertica briefing. Primary subjects included some embargoed stuff, plus (at my instigation) outsourced data marts. But I also had the opportunity to follow up on a couple of points from February’s briefing, namely:

Vertica has about 35 paying customers. That doesn’t sound like a lot more than they had a quarter ago, but first quarters can be slow.

Vertica’s list price is $150K/terabyte of user data. That sounds very high versus the competition. On the other hand, if you do the math versus what they told me a few months ago — average initial selling price $250K or less, multi-terabyte sites — it’s obvious that discounting is rampant, so I wouldn’t actually assume that Vertica is a high-priced alternative.

Vertica does stress several reasons for thinking its TCO is competitive. First, with all that compression and performance, they think their hardware costs are very modest. Second, with the self-tuning, they think their DBA costs are modest too. Finally, they charge only for deployed data; the software that stores copies of data for development and test is free.

Next Page →

Feed including blog about database management, data warehousing, and business intelligence Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.