September 1, 2008

Estimating user data vs. spinning disk

There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:

Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.

*I presume that “Thor” is the successor to “Thumper” because the names alliterate, and Thor thumped things

Here goes.

Thus, before factoring in compression, the amount of user data one can put on a 48-disk box is 8.5X the rated capacity of one disk (which may be ¼, ½, or 1 terabyte). However, just to be on the safe side, what Greenplum actually quotes is a factor of 6X, for what one might call a “loss factor” of 8:1.

And just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor of 7.5X. (I didn’t actually ask, but I assume that’s including indexes, as is common when discussing overall compression figures. As I understand the application(s), there probably aren’t a lot of indexes anyway.)

And so, after all those calculations, the amount of user data winds up to be almost exactly equal to the amount of spinning disk. (But for a vendor or database with a different compression ratio, that rough equality would of course not hold.)

That assumes I got it all right, of course. There was quite a lot of telephone discussion and back-and-forth e-mail to get even this far, so one or more errors could have easily slipped through.


5 Responses to “Estimating user data vs. spinning disk”

  1. Luke Lonergan on September 2nd, 2008 2:52 am

    Yep – it’s right enough, but confusing!

    One key thing to note – only the RAID10 consumes spindles, the rest of the overhead consumes space on the devices. Spindles are expensive, space is dirt cheap.

    In other words, with RAID10, you need physically distinct devices to enable fault tolerance.

    The inter-host mirroring we provide consumes disk *space* on mirror hosts, allocated from the same group of RAID’ed spindles that the primary space is on.

  2. Introduction to Aster Data and nCluster | DBMS2 -- DataBase Management System Services on September 2nd, 2008 4:56 am

    […] has 360 terabytes of spinning disk, which suggests the implementation there might have one level less of redundancy than some other systems […]

  3. Head to head blog debate between EMC, NetApp, and HP | DBMS2 -- DataBase Management System Services on September 3rd, 2008 10:00 pm

    […] recent foray into measuring disk storage pales by comparison. Share: These icons link to social bookmarking sites where readers can share […]

  4. Oracle Exadata list pricing | DBMS2 -- DataBase Management System Services on September 28th, 2008 1:46 am

    […] x 300 MB and 12 x 1 TB of spinning disk respectively. That’s a 1:3.6 ratio, vs. the 1:8 ratio Greenplum quotes. Differences include 4% of Greenplum’s disks being used for hot spares (Oracle’s […]

  5. Infology.Ru » Blog Archive » Оценивая КПД системы хранения: какую долю объема системы хранения занимают данные пользователя on October 21st, 2008 5:14 pm

    […] Автор: Curt Monash Дата публикации оригинала: 2008-09-01 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.