Notes on columnar/TPC-H compression
I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).
*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.
But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.
And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.
Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems | Leave a Comment |
Storage humor
A Microsoft Answers message board got the question:
I’ve noticed that as I copy data/install programs on my Laptop, the weight of the Laptop increases. I have a bad back and am medically limited on the amount of weight I can carry so I need to be very carefull not to inflict injury upon myself.
I have also noticed my XBox feels heavier as well (the more games I save or purchase from arcade). I generally don’t travel with my XBox so that is not an issue for me, but note the I am having the same results.
My ask, what is the weight/file ratio? So for example, how many GB’s = 6oz? I dread the day I need a dolly to commute to work with my Laptop.
Hilarity ensued.
Categories: Fun stuff, Humor, Storage | 6 Comments |
NoSQL?
Eric Lai emailed today to ask what I thought about the NoSQL folks, and especially whether I thought their ideas were useful for enterprises in general, as opposed to just Web 2.0 companies. That was the first I heard of NoSQL, which seems to be a community discussing SQL alternatives popular among the cloud/big-web-company set, such as BigTable, Hadoop, Cassandra and so on. My short answers are:
- In most cases, no.
- Most of these technologies are designed for simple, high-volume OLTP (OnLine Transaction Processing.) Most large enterprises have an established way of doing OLTP, probably via relational database management systems. Why change?
- MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS.
- There’s one big countervailing factor to all these generalities — schema flexibility.
As for the longer form, let me start by noting that there are two main kinds of reason for not liking SQL. Read more
Correction to a recent quote
I’m quoted in a recent article around Aster’s appliance announcement as saying data warehouse appliances are more suitable for small workgroups of analysts crunching small amounts of data than they are for other uses.
But that’s not what I think at all.
I do think the ease-of-administration pitch for appliances makes them particularly well suited for users who want to scrape by without doing much database adminstration. This is especially appealing to departments or smaller enterprises. And the first/best scenario that comes to mind is indeed a small team of analysts, with good SQL skills but lightweight DBA experience, although Netezza has proved that many other kinds of users can find appliances appealing as well.
But that small team of analysts may maintain the largest database in the firm.
And by the way — notwithstanding the MySpace counterexample, most of Aster’s initial customers had <10 terabyte databases, and I think indeed <5 terabyte. The “frontline” pitch succeeded for Aster before (MySpace again aside) any better-big-data-crunching story did.
Categories: Analytic technologies, Aster Data, Data warehouse appliances, Data warehousing, Theory and architecture | Leave a Comment |
Is Expressor Software accomplishing anything?
Expressor Software is putting out a ton of press releases to the effect that it has signed up another reseller/systems integration partner or, in some cases, sponsored a webinar. Less clear is whether Expressor is selling much of anything, delivering product people care about, and so on. The one time I visited, Expressor told me that user interface was its strength, then showed me something very primitive and explained — as the famed joke* would have it — how good it was going to be.
*That would be the Thrice-Married Virgin, although I’ve recently seen versions in which the poor unfortunate was married 12 times. The last husband on the list is always a computer or software salesman, who keeps telling her how good it is going to be. I first heard the joke from Flip Filipowski. I decided it must not be too terribly sexist after hearing Sandy Kurtzig tell it to a group of stock analysts.
Am I missing anything major?
Edit: I emailed the company on May 8, asking what Expressor had in the way of customers. There has been no response.
Categories: EAI, EII, ETL, ELT, ETLT, Expressor, Humor | 9 Comments |
Xtreme Data readies a different kind of FPGA-based data warehouse appliance
Xtreme Data called me to talk about its plans in the data warehouse appliance business, almost all details of which are currently embargoed. Still, a few points may be worth noting ahead of more precise information, namely:
- Xtreme Data’s basic idea is to take a custom board and build a data warehouse appliance around it.
- An Xtreme Data board looks a lot like a conventional two-socket board, but has only one four-core CPU. In addition, it sports some FPGAs (Field-Programmable Gate Arrays).
- In the Xtreme Data appliance, the FPGAs will be used for core SQL processing, after the data is ingested via conventional I/O. This is different from Netezza’s approach to FPGA-based data warehouse appliances, in which the FPGA sits in the place of a disk controller and touches the data first, before passing it off to a more or less conventional CPU.
- While preparing entry into the data warehouse appliance business, Xtreme Data has sold its board to 150 other outfits, many quite impressive. Buyers seem to be FPGA users who previously had to craft their own custom boards. According to Xtreme Data, major uses by these customers include:
- Military/intelligence/digital signal processing.
- Military/intelligence/cybersecurity (a newish area for Xtreme Data)
- Bioinformatics/high-throughput gene sequencing (a “handful” of customers)
- Medical imaging
- More or less pure university research of various sorts (around 50 customers)
- … but not database management.
- Xtreme Data’s website has a non-obvious URL. 🙂
So far as I can tell, Xtreme Data’s 1.0 product will — like most other 1.0 analytic database management products — be focused on price/performance, without little or no positive differentiation in the way of features.
Categories: Data warehouse appliances, Data warehousing, Netezza, Theory and architecture, XtremeData | 6 Comments |
Aster Data enters the appliance game
Aster Data is rolling out a line of nCluster appliances today. Highlights include:
- Configurations ranging from 9 6.25 terabytes to 1 petabyte of user data. (Edit: Here’s the up-to-date data sheet.)
- A $50K “Express Edition” price for <1 terabyte of user data. Unfortunately, that’s the only stated price.
- The option of bundled MicroStrategy.
- “MapReduce” in the name, which suggests something about the positioning — i.e., enterprise decision support, rather than Aster’s usual web/”frontline” emphasis. (Edit: That also fits with Aster’s recent MapReduce-for-.NET announcement.) (Edit: Actual name is Aster MapReduce Data Warehouse Appliance.)
- Claims that because Aster runs effectively on cheaper, more truly “commodity” hardware than competitors, you get more hardware bang for the buck if you buy from Aster.
I don’t have a lot more to add right now, mainly because I wrote at some length about Aster’s non-appliance-specific, non-MapReduce technology and positioning a couple of weeks ago.
Categories: Analytic technologies, Aster Data, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, MapReduce, Pricing | 16 Comments |
My current customer list among the analytic DBMS specialists
(This is an updated version of an August, 2008 post.)
One of my favorite pages on the Monash Research website is the list of many current and a few notable past customers. (Another favorite page is the one for testimonials.) For a variety of reasons, I won’t undertake to be more precise about my current customer list than that. But I don’t think it would hurt anything to list the analytic/data warehouse DBMS/appliance specialists in the group. They are:
- Aster Data
- Greenplum
- Infobright
- Kickfire
- Kognitio
- Microsoft
- Netezza (my biggest client this year, probably, because of all the Enzee Universe appearances)
- Sybase
- Teradata
- Vertica
- Attivio, which may or may not be construed as being in the analytic DBMS business
- Clearpace, ditto
All of those are Monash Advantage members.
If you care about all this, you may also be interested in the rest of my standards and disclosures.
Categories: About this blog, Aster Data, Data warehousing, Greenplum, Infobright, Kickfire, Microsoft and SQL*Server, Netezza, Sybase, Teradata, Vertica Systems | 4 Comments |
ParAccel pricing
As I noted in connection with ParAccel’s recent TPC-H filing, I think the whole exercise is basically an expensive joke. But one slightly useful spin-off is that ParAccel disclosed pricing. Specifically, ParAccel’s stated price in the disclosure document is:
- $100,000/TB license fee (user data). That’s like Vertica, although I don’t know whether ParAccel emulates Vertica’s policy of making test and development licenses free.
- 57% quantity discount at 30 terabytes. That’s not surprising.
- 1% annual maintenance fee (applied to the discounted price). That’s astounding.
Last year ParAccel quoted prices of $100,000/TB or $50,000/server. The latter figure would seem to have led to lower numbers on the benchmark configuration, so perhaps it’s no longer an option on ParAccel’s price list.
Categories: Benchmarks and POCs, Data warehousing, ParAccel, Pricing | 3 Comments |
The TPC-H benchmark is a blight upon the industry
ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.
It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life. The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.
For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.
Meanwhile, the RAM:data ratio is around 1:11 It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that. And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.
More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.