I had one of “those” trips last week:
- 20 meetings, a number of them very multi-hour.
- A broken laptop.
- Flights that arrived 10:30ish Sunday night and left 7:00 Saturday morning.
So please pardon me if things are a bit disjointed …
I’ve argued for a while that:
- All human-generated data should be retained.
- The more important kinds of machine-generated data should be retained as well.
- Raw data isn’t enough; it’s really important to store derived data as well.
Here are a few notes on the derived data trend.
He doesn’t generally use the term, but a big proponent these days of the derived data story is Hortonworks founder/CTO Eric Baldeschwieler, aka Eric 14. Eric likes to position Hadoop as a “data refinery”, where — among other things — you transform data and do “iterative analytics” on it. And he’s getting buy-in; for example, that formulation was prominent in the joint Teradata/Hortonworks vision announcement.
The KXEN guys don’t use the term “derived data” much either, but they tend to see the idea as central to predictive modeling even so. The argument in essence is that traditional predictive modeling consists of three steps:
- Think hard about exactly which variables you want to model on.
- Do transformations on those variables so that they fit into your favored statistical algorithm (commonly linear regression, although KXEN favors nonlinear choices).
- Press a button to run the algorithm.
#3 is the most automated part, and #1 is what KXEN thinks its technology makes unnecessary. Hence #2, they suggest, is often the bulk of the modeling effort — except now they want to automate that away too.
And then there are my new clients at MarketShare, a predictive modeling consulting company focused on marketing use cases, which also has a tech layer (accelerated via the acquisition of JovianDATA). It turns out that a typical MarketShare model is fed by a low double-digit number of other models, each of which is doing some kind of data transformation. The final step is typically a linear regression, yielding coefficients of the sort that marketers recognize and (think they) understand. Earlier steps are typically transformations on individual variables. I didn’t see many examples, but the transformations clearly go beyond the traditional rescaling — log, log (x/(1-x)), binning, whatever — to involve multiplication by what could be construed as other variables. I.e., there seemed to be a polynomial flavor to the whole thing.