September 6, 2011

Derived data, progressive enhancement, and schema evolution

The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:

So let’s dive in. 

When I started my discussion of derived data, I focused on five kinds:

Later I added a sixth kind:

More kinds may yet follow.

In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables — for example the kind frequently created by Microstrategy — are also important in data processing, as is temp space managed solely for the convenience of the DBMS. But neither are what I mean when I talk about “derived data.”

As I noted back in June, derived data naturally leads to schema evolution. You load data into an analytic database. You do some analysis and get some interesting results — interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.

However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it’s there, you may want to derive something from it. 🙂 In marketing contexts, both parts of that might be true in spades.

When I mentioned all this to my clients at MarkLogic — which was my inspiration for the polystructured/metadata example — they perked up and said “Oh! Progressive enhancement.” Indeed, it’s long been the case that a simple text processing pipeline could have >15 steps of extraction; indeed, I learned about the “tokenization chain” in 1997. If all the “progression” in  the data enhancement occurs in a single processing run, that wouldn’t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then — in that case your schema might indeed evolve over time.

Somewhat similarly, Hadoop can be used to run “aggregation pipelines” of many 10s of steps. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.

So to sum up:

“Just the raw facts” analytic databases are, for the most part, obsolete.

Comments

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.