September 1, 2008

Estimating user data vs. spinning disk

There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:

Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
Compression. I write about database compression a lot.
Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)

Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.

*I presume that “Thor” is the successor to “Thumper” because the names alliterate, and Thor thumped things

Here goes.

A Thor box has 48 disks.
2 disks are used for the operating system (i.e., software) rather than data.
2 more disks are reserved as hot spares.
Thus, there are 44 disks that can store data.
Raid 10 knocks off a factor of 2.
General wastage from formatting knocks off a factor of 10% or so.
Temp space is .33 of user data. Internode mirroring adds another 1X. Taken together, those two considerations knock off another factor of 2.33.

Thus, before factoring in compression, the amount of user data one can put on a 48-disk box is 8.5X the rated capacity of one disk (which may be ¼, ½, or 1 terabyte). However, just to be on the safe side, what Greenplum actually quotes is a factor of 6X, for what one might call a “loss factor” of 8:1.

And just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor of 7.5X. (I didn’t actually ask, but I assume that’s including indexes, as is common when discussing overall compression figures. As I understand the application(s), there probably aren’t a lot of indexes anyway.)

And so, after all those calculations, the amount of user data winds up to be almost exactly equal to the amount of spinning disk. (But for a vendor or database with a different compression ratio, that rough equality would of course not hold.)

That assumes I got it all right, of course. There was quite a lot of telephone discussion and back-and-forth e-mail to get even this far, so one or more errors could have easily slipped through.

Categories: Data warehousing, Database compression, Greenplum, Theory and architecture

Subscribe to our complete feed!

Comments

5 Responses to “Estimating user data vs. spinning disk”

Luke Lonergan on September 2nd, 2008 2:52 am

Yep – it’s right enough, but confusing!

One key thing to note – only the RAID10 consumes spindles, the rest of the overhead consumes space on the devices. Spindles are expensive, space is dirt cheap.

In other words, with RAID10, you need physically distinct devices to enable fault tolerance.

The inter-host mirroring we provide consumes disk *space* on mirror hosts, allocated from the same group of RAID’ed spindles that the primary space is on.
Introduction to Aster Data and nCluster | DBMS2 -- DataBase Management System Services on September 2nd, 2008 4:56 am

[…] has 360 terabytes of spinning disk, which suggests the implementation there might have one level less of redundancy than some other systems […]
Head to head blog debate between EMC, NetApp, and HP | DBMS2 -- DataBase Management System Services on September 3rd, 2008 10:00 pm

[…] recent foray into measuring disk storage pales by comparison. Share: These icons link to social bookmarking sites where readers can share […]
Oracle Exadata list pricing | DBMS2 -- DataBase Management System Services on September 28th, 2008 1:46 am

[…] x 300 MB and 12 x 1 TB of spinning disk respectively. That’s a 1:3.6 ratio, vs. the 1:8 ratio Greenplum quotes. Differences include 4% of Greenplum’s disks being used for hot spares (Oracle’s […]
Infology.Ru » Blog Archive » Оценивая КПД системы хранения: какую долю объема системы хранения занимают данные пользователя on October 21st, 2008 5:14 pm

[…] Автор: Curt Monash Дата публикации оригинала: 2008-09-01 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Estimating user data vs. spinning disk

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin