November 29, 2010

Data that is derived, augmented, enhanced, adjusted, or cooked

On this food-oriented weekend, I could easily go on long metaphorical flights about the distinction between “raw” and “cooked” data. I’ll spare you that part — reluctantly, given my fondness for fresh fruit, sushi, and steak tartare — but there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

Probably there are yet more examples that I am at the moment overlooking. But even these should suffice to establish my point, as might even just the broad list of synonyms for the concept of “derived data” I’ve used above. Namely, one of the first questions one should ask in considering an analytic data management strategy is:

Do we have to plan for data other than what we will be storing in raw form?

Any derived data could, in principle, be re-derived each time it is needed, except in those cases where issues of security, data ownership, or whatever prevent access to the underlying raw data entirely. Thus, the reason to store derived data is usually just a matter of physical processing, as reflected in performance, price/performance, response time, and the like. This might suggest that the decision whether or not to explicitly store derived data depends on the performance characteristics of your analytic DBMS and the related technology stack. In practice, however, that often turns out not to be the case.

Choice of technology stack does indeed have a major effect on the first category I mentioned: Aggregates. Whether or not you want to maintain a physical representation of a sum, average, roll-up or whatever has a lot to do with which particular DBMS or in-memory analytic tool you are using. In Oracle, especially pre-Exadata, you’re apt to have a lot of materialized views. In Netezza, not so much. If you’re using a MOLAP (Multidimensional OnLine Analytic Processing) tool such as Essbase, you’re probably going crazy with pre-calculated roll-ups. And if you’re using Skytide, you may not be keeping the unaggregated raw data at all.

Something similar could be said about the simpler forms of data mining scoring; if you’re just doing a weighted sum, precalculation is a nice-to-have, not a must-have, depending on the speed and power of your DBMS. But that’s about as far as it goes. For more complex kinds of predictive analytic models, real-time scoring could be prohibitively slow. Ditto for social graph analysis, and the same goes for the other examples as well.

Text analytics requires a lot of processing per document. You need to tokenize (i.e., identify the boundaries of) the words, sentences, and paragraphs; identify the words’ meaning; map out the grammar; resolve references such as pronouns; and often do more besides (e.g. sentiment analysis). There are a double-digit number of steps to all that, many of them expensive. No way are you going to redo the whole process each time you do a query. Not coincidentally, MarkLogic — a huge fraction of whose business to date is for text-oriented uses — thinks heavily in terms of the enhancement and augmentation of data.

If you look through a list of actual Hadoop or other MapReduce use cases, you’ll see that a lot of them boil down to “crunch data in a big batch job to get it ready for further processing.” Most famously this gets done to weblogs, documents, images, or other nontabular data, but it can also happen to time series or traditional relational tables as well. See, for example, the use cases in the last two Aster Data slide decks I posted. Generally, those are not processes that you want to try to run real time.

Scientists have a massive need to adjust or “cook” data, a point that emerged into the public consciousness in connection with Climategate. The LSST project expects to store 4.5 petabytes of derived data per year, for a decade. Types of scientific data cooking include:

Examples where data adjustment would be needed can be found all over physical and social science and engineering. In some cases you might be able to get by with recalculating all that on the fly, but in many instances storing derived data is the only realistic option.

Similar issues arise in marketing applications, even beyond the kind of straightforward predictive analytics-based scoring and psychographic/clustering results one might expect. For example, suppose you enter bogus information into some kind of online registration form, claiming to be a 90-year-old woman when in fact you’re a 32-year-old male. If you have 400 Facebook friends, the vast majority of whom are in your age range, look at web sites about cars, poker, and video games, and have a propensity to click on ads featuring scantily-clad females, increasingly many analytic systems are smart enough to treat you as somebody other than your grandmother. But those too are complex analyses, run in advance, with the results stored in the database to fuel sub-second ad serving response times.

Related links

Comments

7 Responses to “Data that is derived, augmented, enhanced, adjusted, or cooked”

  1. Dan Weinreb on November 30th, 2010 10:02 am

    One small point: Mike Stonebraker recently gave a talk about the DBMS being developed to store the data from the LSST (amongst other things) called SciDB. One requirement is that the “cooking” can be done again later. OF course, one would rather not pay for this again, but there are various reasons you might want to, and the capability must be retained. E.g. suppose you find out that the new-comet-finding algorithm that you had run over the data had a bug or has been improved? The intention is for the cooked data to be tagged with “provenance” metadata that would say “we got this by taking dataset X and running software Y of version Z over it.” None of this contradicts anything you said; I’m just adding it.

  2. Curt Monash on November 30th, 2010 5:26 pm

    Dan,

    Quite right. As previously noted in, e.g., my posts about SciDB :) , scientists sometimes frame the issue not as “We’ve always kept raw data; let’s keep derived data too” but more as “We’ve always kept cooked data, but we really need to keep the original raw stuff as well.”

    This is a cousin of the trend in commercial data warehousing to keep all detailed data, when in the past all that was kept might have been summaries.

  3. Gareth Horton on December 2nd, 2010 12:23 pm

    We’ve been dealing with reports in this way for many years.

    Although much of this cleansed, aggregated data is accessible somewhere else in raw format, which, with some work can be recreated, there is a surprising amount that is exceedingly complex to reproduce after the fact.

    This is due to highly complex, point in time business rules in LOB systems. It mostly occurs in financial analysis and auditing.

    In addition, there are often regulatory requirements for old aggregated data.

    Here’s a link to a couple of real-world examples – don’t mean it to be a pitch, but that’s the real-world experience I have of these issues:

    http://www.datawatch.com/case_studies/Monarch/banking/Columbia_Bank_data.pdf

    http://www.datawatch.com/case_studies/Monarch/banking/MissionBank_User_Story.pdf

    Gareth

  4. Planning for derived data | Analytics Team on January 23rd, 2011 2:22 pm

    [...] by definition, derived data is based on lower level raw data, it could be derived again as needed. DBMS2 takes us through some thinking about how better to handle derived [...]

  5. Another category of derived data | DBMS 2 : DataBase Management System Services on May 30th, 2011 10:53 pm

    [...] months ago, I argued the importance of derived analytic data, saying … there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted [...]

  6. emilio pucci embroidered dress on October 23rd, 2013 12:57 am

    spending through the internet characteristics a favor turn|kind office|benefit} to get decreased take away from since stores a trim value cut on those same. Companionless url comes with a brochure at emilio pucci align|clothing|habiliments|habits|clothes} supplied by several materials and / or hue styles. making a election on a emilio pucci suite mannequin plus one prior to your distinct existence is also easier than you think in shopping on the internet,

  7. Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:04 am

    [...] scenarios where you incrementally derive and enhance data, it’s natural to want to keep everything in the same place. (That also helps with [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.