On this food-oriented weekend, I could easily go on long metaphorical flights about the distinction between “raw” and “cooked” data. I’ll spare you that part — reluctantly, given my fondness for fresh fruit, sushi, and steak tartare — but there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of “derived data” I’ve used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:
Do we have to plan for data other than what we will be storing in raw form?
Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time, and the like. This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic DBMS and the related technology stack. In practice, however, that often turns out not to be the case.
Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using. In Oracle, especially pre-Exadata, you’re apt to have a lot of materialized views. In Netezza, not so much. If you’re using a MOLAP (Multidimensional OnLine Analytic Processing) tool such as Essbase, you’re probably going crazy with pre-calculated roll-ups. And if you’re using Skytide, you may not be keeping the unaggregated raw data at all.
Something similar could be said about the simpler forms of data mining scoring; if you’re just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that’s about as far as it goes. For more complex kinds of predictive analytic models, real-time scoring could be prohibitively slow. Ditto for social graph analysis, and the same goes for the other examples as well.
Text analytics requires a lot of processing per document. You need to tokenize (i.e., identify the boundaries of) the words, sentences, and paragraphs; identify the words’ meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis). There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. Not coincidentally, MarkLogic — a huge fraction of whose business to date is for text-oriented uses — thinks heavily in terms of the enhancement and augmentation of data.
If you look through a list of actual Hadoop or other MapReduce use cases, you’ll see that a lot of them boil down to “crunch data in a big batch job to get it ready for further processing.” Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. See, for example, the use cases in the last two Aster Data slide decks I posted. Generally, those are not processes that you want to try to run real time.
Scientists have a massive need to adjust or “cook” data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:
- Log processing not unlike that done in various commercial sectors.
- Assigning data to different kinds or densities of coordinate grids — “regridding” — often through a process of interpolation/approximation/estimation.
- Adjusting/normalizing data for all kinds of effects (e.g. weather cycles).
Examples where data adjustment would be needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.
Similar issues arise in marketing applications, even beyond the kind of straightforward predictive analytics-based scoring and psychographic/clustering results one might expect. For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when in fact you’re a 32-year-old male. If you have 400 Facebook friends, the vast majority of whom are in your age range, look at web sites about cars, poker, and video games, and have a propensity to click on ads featuring scantily-clad females, increasingly many analytic systems are smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.
- The #1 or #2 kind of data mart in the Teradata user base is derived data based on a central data store, especially at eBay. Actually, those figures probably cover both derived data and pure extracts — but based on how the conversation went reported in that link, I’d guess that in most cases true derived data is involved.
- There’s a whole other area of derived OLTP data, such as bank balances or inventory levels. It’s important too.