September 28, 2006

Relational data warehouse Expansion (or Explosion) Ratios

One of the least understood aspects of data warehouse technology is what may be called the

Expansion Ratio = (Total disk space used, except for mirroring) / (Size of the base database).

This is similar to the explosion ratio discussed in the OLAP Report’s justly famous discussion of database explosion, but I’m going with my own terminology because I don’t want to be tied to their precise terminology, nor to their technical focus. Expansion Ratios are hotly debated, with some figures being:

I don’t have actual figures from Netezza and DATallegro, but I imagine they’d come out lower than 2X, possibly well below.

Expansion Ratios are a huge factor in data warehouse cost. It’s not just the disks and their controllers; it’s also the processing done of those extra indices or aggregates or whatever, and the human cost of administering them. (E.g., Teradata claims that when moving from Oracle to Teradata, there’s commonly a full order-of-magnitude reduction in both index number and aggregate index size; clearly, the DBA cost in such a scenario would be drastically reduced as well.)

The question naturally arises: What is all this space being used for? Well, some is necessary as a pool for temporary/intermediate tables that will arise in query processing. Some arises if you reserve fixed amounts of space for something before filling them. Some just happens because disks aren’t 100% utilized. But all that shouldn’t be enough to get you up even to 1.5X. All the rest has to come from indices, aggregations, and their equivalents.

Single-column indices, I would think, are far from a complete explanation. Yes, if you indexed every column, you’d clearly have a greater than 1X increment to the Expansion Ratio, certainly before compression and probably after as well. But the increment wouldn’t be much over 1X either. So the big explosion would almost have to be in the multi-column stuff – prejoins, aggregated rollups (with are inherently indexed by multiple dimensions at once), and so on.

But by that logic, if you don’t have a major star/snowflake/multidimensional ROLAP flavor to your database, and you also don’t need any terribly explosive prejoins, you’re not going to have a large Expansion Ratio. Thus, Oracle’s claim to have very low Expansion Ratios in some large warehouses is completely credible. And so I’ll end by quoting Oracle’s claim, verbatim, by permission, from an email I received:

An analysis of eight Oracle data warehouses ranging from 10 TB to 60TB in size revealed that on average 63% of the database was occupied by data within tables, 18% by indexes, and 19% by unused space and temporary space. For the four customers who indicated which tables were summaries (either as materialized views of custom-built summaries), none of these customers dedicated more than 15% of the database to summary data.

Comments

9 Responses to “Relational data warehouse Expansion (or Explosion) Ratios”

  1. William McKnight on September 29th, 2006 4:26 pm

    Curt, I believe that end users of data warehouse technologies have good control over the expansion, or uptake, ratio. It really depends on the level of performance that is needed and the amount of the data that they actually access in online queries. Some sites index everything and others only do it when absolutely necessary. Same goes for creating summary tables and cubes. TPC numbers represent the high end of anything that would be implemented at a real site because the vendors use TPC to brag most about the performance figures, not the price/performance. The Teradata numbers seem about right in practice.

    By the way, Teradata has an interesting new feature called “multi temperature” which allows sites to pay less for the data they don’t access as much. Many have been implementing concepts like this for some time to find a good price/performance balance.

  2. DBMS2 — DataBase Management System Services»Blog Archive » IBM and Teradata too on October 3rd, 2006 1:52 am

    […] When I posted a few days ago about Expansion Ratios, I passed along Teradata’s estimate that there was a significant gap between Teradata and IBM. IBM’s own estimate, however, is that a typical Expansion Ratio for DB2 is a Teradata-like 2-3X in DB2 Version 8, and that it’s falling further in Version 9 due to compression. (DB2’s compression sounds like it’s the most aggressive in the business, specialty columnar or MOLAP products perhaps excepted.) Incidentally, this suggests that the indexing features DB2 has that Teradata doesn’t – e.g., alternate datatypes like geospatial – aren’t heavily used by a large fraction of the customer base. […]

  3. Curt Monash on October 3rd, 2006 3:49 am

    William,

    Yes, but the question is how much you need in the way of indexing and summaries to get the same performance from different systems.

    Also, the grand hash partition that starts off a typical MPP installation saves some index space. And if table scans are fast enough a lot of the time, you need fewer indicies than if they almost never are.

    Anyhow, if you look up to my IBM/Teradata post, IBM claims that indeed their installations wind up very comparable to Teradata’s. But then, IBM and Teradata also have pretty similar overall technical approaches now.

  4. DBMS2 — DataBase Management System Services»Blog Archive » Progress Software progress report on April 28th, 2007 10:20 pm

    […] size vs. Oracle for the same data, due to, for example, variable record lengths. Such factors are not unheard of in data warehousing, of course; but for OLTP his comment was quite the jaw-dropper. On the other hand, OLTP databases […]

  5. DBMS2 — DataBase Management System Services » Blog Archive » How Hyperion will change Oracle on May 24th, 2008 10:03 pm

    […] weak recently, and deservedly so. MOLAP’s problem with database explosion produces ridiculous expansion ratios, recent work to lessen the problem notwithstanding. Nonetheless, the Essbase installed base is […]

  6. Infology.Ru » Blog Archive » Оценивая КПД системы хранения: какую долю объема системы хранения занимают данные пользователя on October 21st, 2008 5:17 pm

    […] Индексы и временное рабочее пространство. Это то, о чем я уже писал пару лет назад в моем посте о коэффициентах расширения. […]

  7. Infobright BrightHouse — columnar, VERY compressed, simple, and related to MySQL | DBMS2 -- DataBase Management System Services on June 5th, 2010 4:10 am

    […] The company says that experience shows >10:1 compression of user data is realistic – i.e., an expansion ratio that’s fractional, and indeed better than 1/10:1. Accordingly, despite the lack of […]

  8. Notes on indexes and index-like structures | DBMS 2 : DataBase Management System Services on April 16th, 2015 6:43 pm

    […] also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for […]

  9. TECNOLOGÍA » DBMS 2 : DataBase Management System Services on November 30th, 2015 9:38 am

    […] also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.