December 7, 2009

A framework for thinking about data warehouse growth

There are only three ways that the amount of data stored in data warehouses can grow:

The same kinds of data are stored as before, with more being added over time.
The same kinds of data are stored as before, but in more detail.
New kinds of data are stored.

The first of those three ways doesn’t lead to dramatic growth. If a data warehouse goes up from 5 years of data to 6, then its overall size will grow a little over 20%. (How little depends on what the underlying business growth is – i.e., on how many more business events you have next year than you had 3 years ago.) That’s almost certainly going to be well-handled, by whatever technology manages your data warehouse today, given that:

Chips are still subject to something resembling Moore’s Law.
Disk capacity is still subject to Kryder’s Law, which is like Moore’s Law but with yet faster growth rates.
DBMS software gets more performant over time.

So the cost of managing your same-as-before data will go down every year, even as the volume of that data grows.

True, disk rotation speeds have only increased 12.5 times since the Eisenhower Administration. But solid-state drives (SSDs) are getting practical for data warehousing fast, so even that bottleneck eventually will get swept away. And since what we’re discussing is, basically, the first and hence presumably highest-value data to be warehoused, it’s apt to wind up on SSDs before some other kinds of data warrant that treatment. So it’s the two other factors that drive the greatest data warehouse growth.

As costs go down, the wisdom of keeping detailed data goes up. I’d go so far as to say that every piece of data generated by a human being should be preserved and kept online, legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset. Don’t discard it lightly.

*Unless there’s an explicit law mandating data destruction, legal considerations should permit. The idea “Let’s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.

What that means in practice is that “disk is the new tape.” No-apologies performance can be had on data warehouse systems for $20,000/terabyte or less – perhaps even a lot less. Tolerable performance may cost 3-4X less than that. I think a lot of the growth in data warehouse volumes is of exactly this kind.

Ultimately, however, the greatest growth in data warehouse volumes will come from new kinds of data, especially data that is partly or wholly machine-generated. Moore’s Law applied to sensor chips tells us that data creation will grow just as fast as the data storage capacity. And thus we will be throwing away most machine-generated data forever. But what we keep will grow – well, it probably will grow at Moore’s/Kryder’s Law speeds.

That’s not to say new kinds of data are all high-volume/machine-generated. Back in 2005, I wrote two pieces for Computerworld advocating aggressive pursuit of new data sources, and the examples I mentioned were:

Loyalty cards, especially in gaming
Location-based analytics
Extra customer feedback (e.g., opinion surveys)
Price/offer testing
Text mining in general
Medical records

Today I’d add (among others):

RFID
The raw output from medical test devices
Sensors up and down the energy supply chain

But some of those older, low-data-volume ideas still head my list of low-hanging analytic fruit.

One more complication – these buckets I’m outlining are less than precise. For example:

Telecom CDRs (Call Detail Records) are machine-generated from a seed of human activity. They have long been stored, but now are being kept in much more detail. This is why telecommunications is one of the top markets for data warehouse technology.
Stock trade data used to be based on human decisions. Now most of it is just machines buying and selling from each other. Either way, increasingly many investment institutions want to keep 100-terabyte-scale databases of complete historical trade detail. And that is why financial services is another huge market for data warehouse technology.
Not long ago, web and network event logs. didn’t even exist, or were tiny where they did. Now they fill the largest known commercial databases, at firms such as Yahoo, eBay, and Facebook. Even so, more is thrown away than kept, especially on the network event side, which is a multiple of the size of the pure clickstream data.
We don’t know exactly what all data intelligence agencies collect from telemetry, from monitoring commercial telecommunication traffic, and so on. But they’re surely throwing the vast majority away, even as the small part they keep is petabyte-scale.

But none of that interferes with my main points, which are:

Databases will continue to grow very quickly.
One big driver is the increasing detail in which data is kept online.
An even bigger driver will be the unending ability of machines to generate ever greater streams of at-least-somewhat interesting data.

Categories: Analytic technologies, Application areas, Data warehousing, Investment research and trading, Log analysis, Solid-state memory, Storage, Telecommunications, Text, Web analytics

Subscribe to our complete feed!

Comments

9 Responses to “A framework for thinking about data warehouse growth”

Jean-Luc Chatelain on December 7th, 2009 10:05 am

Curt

While I agree with the overall article, I challenge the statement that “So the cost of managing your same-as-before data will go down every year, even as the volume of that data grows”.

Albeit being true from a fixed asset “magnetic real estate cost”, this is not accurate from a TCO point of view. Data which is kept is meant to be be searched/retrieved/analyzed at one point which requires indexing (of all kinds). The indexing process has a high human management cost factor which as yet to be automated, kept linear with data growth. This cost is a large % of the TCO. The TCO of completely driven by the # of information objects being kept and not by the Tb capavity.

Again good article but customers have to be educated on the TCO aspect of keeping information to make the right business support IT/Infrastructure decisions.

Regards,

Jean-Luc
@informationCTO
Curt Monash on December 7th, 2009 10:13 am

Jean-Luc,

Well, I was a litle sloppy in that you don’t really lower your costs so much as an identical enterprise to you starting a new warehouse like yours has to pay less for it.

However, adding new records to an old schema does not increase the DBA cost. Instead, it should stay essentially flat. So I’m going to stand by my opinion that costs for doing “more of the same” are not something to worry much about (except insofar as you want to try to slash them).

Something I didn’t cover in this post was what happens if you want to do a lot more aggressive analytics against what’s basically the same database. That could drive your costs up and lead you to change technology just as higher data volumes can.
Jerome Pineau on December 7th, 2009 5:45 pm

I think one question I almost never see discussed is at which point is enough data just that – enough? Is the competitive assumption that the guy with the most data wins the battle? Or is it the guy with the most *important* data? Or is it the guy who can tell you best where _not_ to look in a mass of information? It seems to me the industry is more concerned about hoarding than strategically selecting what to keep (and why). I might be completely off on this, but too much information can be as lethal as insufficient data.
Curt Monash on December 9th, 2009 11:44 am

Jerome,

One can put a filter into any query that pretends one had kept less data in the first place. The problem you’re talking about is, at least in principle, trivial.
Paul Johnson on December 10th, 2009 2:48 pm

CPU clock speeds and disk storage density may both be on predictable, upward paths, but these increases don’t directly translate into increased DW capability.

The typical DW was, is, and is likely to remain IO-bound (SSD is miles away from mass adoption). It doesn’t matter how many CPU clock cycles we have, nor how much data we can store, if we can’t feed data to the CPU at a greater rate.

The increases in IO sub-system performance have not kept pace with CPU clock speed increases nor disk density increases for many, many years.

In fact, relative to both our ability to store and process data, our ability to access/retrieve data is often in reverse. Calculate your bandwidth/storage ratios over time to see what I mean.

When did any of us last enjoy a doubling of IO bandwidth in an 18 month period at little/no extra, or even less, cost? This is something that’s almost taken for granted with regards CPU power.

Machine generated data is the key driver for the massive data volume increase that are expected in certain sectors in the short/medium term.

On one hand this causes headaches due to the volumes in play, on the other hand it tends to be higher quality data than that generated via ‘key-to-disk’ OLTP systems where pesky humans input all types of junk!
Curt Monash on December 11th, 2009 12:34 am

Two things:

1. CPU clock speeds aren’t really increasing. 🙂
2. SSD I/O rates are poised to explode, a lot sooner than you evidently think.
Paul Johnson on December 11th, 2009 7:52 am

1. True, but I was referring to the Moore’s Law growth in CPU speeds that we’ve historically enjoyed.

2. I made no comment on SSD I/O rates: ‘SSD is miles away from mass adoption’. I have no doubt about the impressive I/O performance…the laptop on which I’m typing this runs SSD for good reason. I’d hate to buy more than 120GB worth though 😉
The retention of everything | DBMS2 -- DataBase Management System Services on April 4th, 2010 3:25 am

[…] I’d like to reemphasize a point I’ve been making for a while about data retention: […]
Examples and definition of machine-generated data | DBMS 2 : DataBase Management System Services on December 30th, 2010 4:17 am

[…] posts made last December, January, and April, I […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

A framework for thinking about data warehouse growth

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin