There’s a lot of confusion about how to measure data warehouse database size. Major complicating factors include:
- Indexes and temporary working space. That’s what I emphasized a couple of years ago in my post about Expansion Ratios.
- Compression. I write about database compression a lot.
- Disk redundancy. I usually gloss over that one, but I’ll try to make amends in this post.
- Replication other than that which is primarily designed for redundancy. I usually gloss over that one too, and I think it’s safe to continue doing so. That’s because data warehouse replication – at least in most of the system architectures I know of – generally divides into three categories:
- a lot like redundancy
- a lot like an index
- only a minor issue (e.g., when small fact tables are replicated across each node of an MPP cluster)
Greenplum’s CTO Luke Lonergan recently walked me through the general disk usage arithmetic for Greenplum’s most common configuration (Sun Thors*, configured to Raid 10). I found it pretty interesting, and a good guide to factors that also affect other systems, from other vendors.
*I presume that “Thor” is the successor to “Thumper” because the names alliterate, and Thor thumped things
- A Thor box has 48 disks.
- 2 disks are used for the operating system (i.e., software) rather than data.
- 2 more disks are reserved as hot spares.
- Thus, there are 44 disks that can store data.
- Raid 10 knocks off a factor of 2.
- General wastage from formatting knocks off a factor of 10% or so.
- Temp space is .33 of user data. Internode mirroring adds another 1X. Taken together, those two considerations knock off another factor of 2.33.
Thus, before factoring in compression, the amount of user data one can put on a 48-disk box is 8.5X the rated capacity of one disk (which may be ¼, ½, or 1 terabyte). However, just to be on the safe side, what Greenplum actually quotes is a factor of 6X, for what one might call a “loss factor” of 8:1.
And just to confuse things — compression can get most or all of that back. For example, at a multi-petabyte customer that is loading up its Greenplum/Thor machines now, early indications suggest a compression factor of 7.5X. (I didn’t actually ask, but I assume that’s including indexes, as is common when discussing overall compression figures. As I understand the application(s), there probably aren’t a lot of indexes anyway.)
And so, after all those calculations, the amount of user data winds up to be almost exactly equal to the amount of spinning disk. (But for a vendor or database with a different compression ratio, that rough equality would of course not hold.)
That assumes I got it all right, of course. There was quite a lot of telephone discussion and back-and-forth e-mail to get even this far, so one or more errors could have easily slipped through.