One of the least understood aspects of data warehouse technology is what may be called the
Expansion Ratio = (Total disk space used, except for mirroring) / (Size of the base database).
This is similar to the explosion ratio discussed in the OLAP Report’s justly famous discussion of database explosion, but I’m going with my own terminology because I don’t want to be tied to their precise terminology, nor to their technical focus. Expansion Ratios are hotly debated, with some figures being:
- Teradata claims an Expansion Ratio of 8-9X for Oracle, 6X for DB2 (open system version), and 2.5X for Teradata. The underlying source is data warehouses they’ve replaced, so there may be a bias toward out-of-control warehouses on the part of their competitors.
- An anonymous appliance vendor exec said to me off the top of his head that Oracle has 6-8X Expansion Ratios.
- Oracle’s TPC-H submissions in the largest size range (10 terabytes) have 9.7-10.5X Expansion Ratios, if I’m reading the TPCs correctly.
- Oracle cites a survey of 8 customers with 10-60 Tb database size in which the Expansion Ratio works out to 1.6X. (More on this anomalous result below.)
I don’t have actual figures from Netezza and DATallegro, but I imagine they’d come out lower than 2X, possibly well below.
Expansion Ratios are a huge factor in data warehouse cost. It’s not just the disks and their controllers; it’s also the processing done of those extra indices or aggregates or whatever, and the human cost of administering them. (E.g., Teradata claims that when moving from Oracle to Teradata, there’s commonly a full order-of-magnitude reduction in both index number and aggregate index size; clearly, the DBA cost in such a scenario would be drastically reduced as well.)
The question naturally arises: What is all this space being used for? Well, some is necessary as a pool for temporary/intermediate tables that will arise in query processing. Some arises if you reserve fixed amounts of space for something before filling them. Some just happens because disks aren’t 100% utilized. But all that shouldn’t be enough to get you up even to 1.5X. All the rest has to come from indices, aggregations, and their equivalents.
Single-column indices, I would think, are far from a complete explanation. Yes, if you indexed every column, you’d clearly have a greater than 1X increment to the Expansion Ratio, certainly before compression and probably after as well. But the increment wouldn’t be much over 1X either. So the big explosion would almost have to be in the multi-column stuff – prejoins, aggregated rollups (with are inherently indexed by multiple dimensions at once), and so on.
But by that logic, if you don’t have a major star/snowflake/multidimensional ROLAP flavor to your database, and you also don’t need any terribly explosive prejoins, you’re not going to have a large Expansion Ratio. Thus, Oracle’s claim to have very low Expansion Ratios in some large warehouses is completely credible. And so I’ll end by quoting Oracle’s claim, verbatim, by permission, from an email I received:
An analysis of eight Oracle data warehouses ranging from 10 TB to 60TB in size revealed that on average 63% of the database was occupied by data within tables, 18% by indexes, and 19% by unused space and temporary space. For the four customers who indicated which tables were summaries (either as materialized views of custom-built summaries), none of these customers dedicated more than 15% of the database to summary data.