In my recent post on Exadata pricing, I highlighted the importance of Oracle’s compression figures to the discussion, and the uncertainty about same. This led to a Twitter discussion featuring Greg Rahn* of Oracle and Dave Menninger and Omer Trajman of Vertica. I also followed up with Omer on the phone.
*Guys like Greg Rahn and Kevin Closson are huge assets to Oracle, which is absurdly and self-defeatingly unhelpful through conventional public/analyst relations channels.
I think the HCC 10x compression is a slideware (common) number. Personally I’ve seen it in the 12-17x range on customer data…
This was on a dimensional model. Can’t speak to the specific industry. I do believe Oracle is working on getting industry #s.
As far as I know, Exadata HCC uses a superset of compression algorithms that the commonly known column stores use…
…and it doesn’t require the compression type be in the DDL like Vertica or ParAccel. It figures out the best algo to apply.
The compression number I quoted is sizeof(uncompressed)/sizeof(hcc compressed). No indexes were used in this case.
Exadata HCC is applicable for bulk loaded (fact table) data, so a significant portion (size wise) of most DWs.
Summing up, that seems to say:
- Oracle claims 12-17X compression on a kind of data similar to that on which Vertica — which also uses 10X as a single-point overall compression marketing estimate where needed — claims 20X.
- Oracle selects compression algorithms automagically.
- Oracle’s compression doesn’t quite apply to all the data. Actually, this may be more of an issue for the caching benefits of compression than for the I/O or disk storage gains. (If you join a retail transaction fact table to a customer dimension table, and you have a lot of customers, fitting the uncompressed customer table into RAM could be problematic.)
Omer and I happened to have a call scheduled to discuss MapReduce yesterday evening, but wound up using most of the time to talk about Vertica’s compression and physical layout features instead. Highlights included:
- Greg, like many Vertica competitors, was wrong about Vertica requiring manual, low-level DDL (Data Description Language) for — well, for much of anything. Vertica does all that automatically, at least in theory, and suggests that in real life you can indeed often get by without manual intervention.
- Vertica can do trickle feeds into its compressed columnar storage. Greg seemed to suggest Oracle Exadata can not. (However, I won’t be surprised if, when his comments are expanded to more than 140 characters, he winds up saying the opposite. )
- Omer characterized the lowest latency with which you can get data into Vertica and have it be available for query as “seconds”, vs. “minutes” for other columnar vendors.
- Vertica recommends often keeping multiple copies of a column, for high availability and/or performance. This is not directly reflected in compression estimates. In particular, if you’re going to keep redundant copies of data for data-safety reasons anyway, Vertica recommends that you:
- Run queries against more than one copy of the data, for performance/throughput.
- Store different copies of the columns in different sort orders — e.g., according to different likely join keys — so that the copies are optimized for performance on different classes of queries.
- Vertica doesn’t have indexes.
- Vertica sorts columns on ingest. This sorting is, of course, commonly based on attributes from columns other than the one being sorted. Even so, Omer maintains that sorting helps compression, because of the correlation between columns. Examples (and I didn’t get these all from him) might include:
- City/postal code
- Customer_ID/store location
- Vertica, based on the recent introduction of FlexStore, has an ILM (Information Lifecycle Management) story much like Sybase IQ’s. E.g., you can keep different data ranges for different columns on fast storage, while the rest of the data is relegated to slower/cheaper equipment.