There are only three ways that the amount of data stored in data warehouses can grow:
- The same kinds of data are stored as before, with more being added over time.
- The same kinds of data are stored as before, but in more detail.
- New kinds of data are stored.
The first of those three ways doesn’t lead to dramatic growth. If a data warehouse goes up from 5 years of data to 6, then its overall size will grow a little over 20%. (How little depends on what the underlying business growth is – i.e., on how many more business events you have next year than you had 3 years ago.) That’s almost certainly going to be well-handled, by whatever technology manages your data warehouse today, given that:
- Chips are still subject to something resembling Moore’s Law.
- Disk capacity is still subject to Kryder’s Law, which is like Moore’s Law but with yet faster growth rates.
- DBMS software gets more performant over time.
So the cost of managing your same-as-before data will go down every year, even as the volume of that data grows.
True, disk rotation speeds have only increased 12.5 times since the Eisenhower Administration. But solid-state drives (SSDs) are getting practical for data warehousing fast, so even that bottleneck eventually will get swept away. And since what we’re discussing is, basically, the first and hence presumably highest-value data to be warehoused, it’s apt to wind up on SSDs before some other kinds of data warrant that treatment. So it’s the two other factors that drive the greatest data warehouse growth.
As costs go down, the wisdom of keeping detailed data goes up. I’d go so far as to say that every piece of data generated by a human being should be preserved and kept online, legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset. Don’t discard it lightly.
*Unless there’s an explicit law mandating data destruction, legal considerations should permit. The idea “Let’s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.
What that means in practice is that “disk is the new tape.” No-apologies performance can be had on data warehouse systems for $20,000/terabyte or less – perhaps even a lot less. Tolerable performance may cost 3-4X less than that. I think a lot of the growth in data warehouse volumes is of exactly this kind.
Ultimately, however, the greatest growth in data warehouse volumes will come from new kinds of data, especially data that is partly or wholly machine-generated. Moore’s Law applied to sensor chips tells us that data creation will grow just as fast as the data storage capacity. And thus we will be throwing away most machine-generated data forever. But what we keep will grow – well, it probably will grow at Moore’s/Kryder’s Law speeds.
That’s not to say new kinds of data are all high-volume/machine-generated. Back in 2005, I wrote two pieces for Computerworld advocating aggressive pursuit of new data sources, and the examples I mentioned were:
- Loyalty cards, especially in gaming
- Location-based analytics
- Extra customer feedback (e.g., opinion surveys)
- Price/offer testing
- Text mining in general
- Medical records
Today I’d add (among others):
- The raw output from medical test devices
- Sensors up and down the energy supply chain
But some of those older, low-data-volume ideas still head my list of low-hanging analytic fruit.
One more complication – these buckets I’m outlining are less than precise. For example:
- Telecom CDRs (Call Detail Records) are machine-generated from a seed of human activity. They have long been stored, but now are being kept in much more detail. This is why telecommunications is one of the top markets for data warehouse technology.
- Stock trade data used to be based on human decisions. Now most of it is just machines buying and selling from each other. Either way, increasingly many investment institutions want to keep 100-terabyte-scale databases of complete historical trade detail. And that is why financial services is another huge market for data warehouse technology.
- Not long ago, web and network event logs. didn’t even exist, or were tiny where they did. Now they fill the largest known commercial databases, at firms such as Yahoo, eBay, and Facebook. Even so, more is thrown away than kept, especially on the network event side, which is a multiple of the size of the pure clickstream data.
- We don’t know exactly what all data intelligence agencies collect from telemetry, from monitoring commercial telecommunication traffic, and so on. But they’re surely throwing the vast majority away, even as the small part they keep is petabyte-scale.
But none of that interferes with my main points, which are:
- Databases will continue to grow very quickly.
- One big driver is the increasing detail in which data is kept online.
- An even bigger driver will be the unending ability of machines to generate ever greater streams of at-least-somewhat interesting data.