Comments on: A framework for thinking about data warehouse growth

By: Examples and definition of machine-generated data | DBMS 2 : DataBase Management System Services

Thu, 30 Dec 2010 08:17:45 +0000

[…] posts made last December, January, and April, I […]

By: The retention of everything | DBMS2 -- DataBase Management System Services

The retention of everything | DBMS2 -- DataBase Management System Services — Sun, 04 Apr 2010 07:25:41 +0000

[…] I’d like to reemphasize a point I’ve been making for a while about data retention: […]

By: Paul Johnson

Paul Johnson — Fri, 11 Dec 2009 11:52:29 +0000

1. True, but I was referring to the Moore’s Law growth in CPU speeds that we’ve historically enjoyed.

2. I made no comment on SSD I/O rates: ‘SSD is miles away from mass adoption’. I have no doubt about the impressive I/O performance…the laptop on which I’m typing this runs SSD for good reason. I’d hate to buy more than 120GB worth though 😉

By: Curt Monash

Curt Monash — Fri, 11 Dec 2009 04:34:59 +0000

Two things:

1. CPU clock speeds aren’t really increasing. 🙂
2. SSD I/O rates are poised to explode, a lot sooner than you evidently think.

By: Paul Johnson

Paul Johnson — Thu, 10 Dec 2009 18:48:49 +0000

CPU clock speeds and disk storage density may both be on predictable, upward paths, but these increases don’t directly translate into increased DW capability.

The typical DW was, is, and is likely to remain IO-bound (SSD is miles away from mass adoption). It doesn’t matter how many CPU clock cycles we have, nor how much data we can store, if we can’t feed data to the CPU at a greater rate.

The increases in IO sub-system performance have not kept pace with CPU clock speed increases nor disk density increases for many, many years.

In fact, relative to both our ability to store and process data, our ability to access/retrieve data is often in reverse. Calculate your bandwidth/storage ratios over time to see what I mean.

When did any of us last enjoy a doubling of IO bandwidth in an 18 month period at little/no extra, or even less, cost? This is something that’s almost taken for granted with regards CPU power.

Machine generated data is the key driver for the massive data volume increase that are expected in certain sectors in the short/medium term.

On one hand this causes headaches due to the volumes in play, on the other hand it tends to be higher quality data than that generated via ‘key-to-disk’ OLTP systems where pesky humans input all types of junk!

By: Curt Monash

Curt Monash — Wed, 09 Dec 2009 15:44:58 +0000

Jerome,

One can put a filter into any query that pretends one had kept less data in the first place. The problem you’re talking about is, at least in principle, trivial.

By: Jerome Pineau

Jerome Pineau — Mon, 07 Dec 2009 21:45:23 +0000

I think one question I almost never see discussed is at which point is enough data just that – enough? Is the competitive assumption that the guy with the most data wins the battle? Or is it the guy with the most *important* data? Or is it the guy who can tell you best where _not_ to look in a mass of information? It seems to me the industry is more concerned about hoarding than strategically selecting what to keep (and why). I might be completely off on this, but too much information can be as lethal as insufficient data.

By: Curt Monash

Curt Monash — Mon, 07 Dec 2009 14:13:55 +0000

Jean-Luc,

Well, I was a litle sloppy in that you don’t really lower your costs so much as an identical enterprise to you starting a new warehouse like yours has to pay less for it.

However, adding new records to an old schema does not increase the DBA cost. Instead, it should stay essentially flat. So I’m going to stand by my opinion that costs for doing “more of the same” are not something to worry much about (except insofar as you want to try to slash them).

Something I didn’t cover in this post was what happens if you want to do a lot more aggressive analytics against what’s basically the same database. That could drive your costs up and lead you to change technology just as higher data volumes can.

By: Jean-Luc Chatelain

Jean-Luc Chatelain — Mon, 07 Dec 2009 14:05:43 +0000

Curt

While I agree with the overall article, I challenge the statement that “So the cost of managing your same-as-before data will go down every year, even as the volume of that data grows”.

Albeit being true from a fixed asset “magnetic real estate cost”, this is not accurate from a TCO point of view. Data which is kept is meant to be be searched/retrieved/analyzed at one point which requires indexing (of all kinds). The indexing process has a high human management cost factor which as yet to be automated, kept linear with data growth. This cost is a large % of the TCO. The TCO of completely driven by the # of information objects being kept and not by the Tb capavity.

Again good article but customers have to be educated on the TCO aspect of keeping information to make the right business support IT/Infrastructure decisions.

Regards,

Jean-Luc
@informationCTO