The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:
- Derived data.
- Many-step processes to produce derived data.
- Schema evolution.
- Temporary data constructs.
So let’s dive in.
When I started my discussion of derived data, I focused on five kinds:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Later I added a sixth kind:
- Derived metadata, commonly for polystructured data sets (logs, text, images, video, whatever).
More kinds may yet follow.
In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables — for example the kind frequently created by Microstrategy — are also important in data processing, as is temp space managed solely for the convenience of the DBMS. But neither are what I mean when I talk about “derived data.”
As I noted back in June, derived data naturally leads to schema evolution. You load data into an analytic database. You do some analysis and get some interesting results — interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.
However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it’s there, you may want to derive something from it. In marketing contexts, both parts of that might be true in spades.
When I mentioned all this to my clients at MarkLogic — which was my inspiration for the polystructured/metadata example — they perked up and said “Oh! Progressive enhancement.” Indeed, it’s long been the case that a simple text processing pipeline could have >15 steps of extraction; indeed, I learned about the “tokenization chain” in 1997. If all the “progression” in the data enhancement occurs in a single processing run, that wouldn’t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then — in that case your schema might indeed evolve over time.
Somewhat similarly, Hadoop can be used to run “aggregation pipelines” of many 10s of steps. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.
So to sum up:
- Derived data — of multiple kinds — is very important.
- If you want to increase the value you get from derived data, you might need to evolve your schema accordingly.
- Data derivation happens to sometimes have long processing pipelines; those pipelines might happen to offer clues as how to do yet better at data derivation in the future; those improvements might happen to lead to schema evolution over time.
“Just the raw facts” analytic databases are, for the most part, obsolete.