September 6, 2011

Derived data, progressive enhancement, and schema evolution

The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:

Derived data.
Many-step processes to produce derived data.
Schema evolution.
Temporary data constructs.

So let’s dive in.

When I started my discussion of derived data, I focused on five kinds:

Aggregates, when they are maintained, generally for reasons of performance or response time.

Calculated scores, commonly based on data mining/predictive analytics.

Text analytics.

The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.

Adjusted data, especially in scientific contexts.

Later I added a sixth kind:

Derived metadata, commonly for polystructured data sets (logs, text, images, video, whatever).

More kinds may yet follow.

In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables — for example the kind frequently created by Microstrategy — are also important in data processing, as is temp space managed solely for the convenience of the DBMS. But neither are what I mean when I talk about “derived data.”

As I noted back in June, derived data naturally leads to schema evolution. You load data into an analytic database. You do some analysis and get some interesting results — interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.

However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it’s there, you may want to derive something from it. 🙂 In marketing contexts, both parts of that might be true in spades.

When I mentioned all this to my clients at MarkLogic — which was my inspiration for the polystructured/metadata example — they perked up and said “Oh! Progressive enhancement.” Indeed, it’s long been the case that a simple text processing pipeline could have >15 steps of extraction; indeed, I learned about the “tokenization chain” in 1997. If all the “progression” in the data enhancement occurs in a single processing run, that wouldn’t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then — in that case your schema might indeed evolve over time.

Somewhat similarly, Hadoop can be used to run “aggregation pipelines” of many 10s of steps. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.

So to sum up:

Derived data — of multiple kinds — is very important.
If you want to increase the value you get from derived data, you might need to evolve your schema accordingly.
Data derivation happens to sometimes have long processing pipelines; those pipelines might happen to offer clues as how to do yet better at data derivation in the future; those improvements might happen to lead to schema evolution over time.

“Just the raw facts” analytic databases are, for the most part, obsolete.

Categories: Data models and architecture, Data warehousing, Derived data, MarkLogic, Text

Subscribe to our complete feed!

Comments

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Derived data, progressive enhancement, and schema evolution

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin