May 30, 2011

Another category of derived data

Six months ago, I argued the importance of derived analytic data, saying

… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

Aggregates, when they are maintained, generally for reasons of performance or response time.

Calculated scores, commonly based on data mining/predictive analytics.

Text analytics.

The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.

Adjusted data, especially in scientific contexts.

Probably there are yet more examples that I am at the moment overlooking.

Well, I did overlook at least one category. 🙂

A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.

Actually, what made me think of writing this post was a few conversations at MarkLogic’s April user conference. For example, MarkLogic likes to break lunch up into subject-specific tables, hosted either by a partner company, or by one of the analysts who is attending anyway. So they asked me to hold a table about having Hadoop and MarkLogic work together. When I showed up, I discovered that most of the users at the table worked for a single organization; what’s more, they were skeptical about the table’s discussion subject, and wanted to be see if I could persuade them otherwise. I gently pointed out that I hadn’t actually picked the subject, and asked them what their use cases might be like. Those turned out to be classified …

… but have no fear! Your hero thought quickly, and soon was holding forth about various ways one might combine the two technologies for various intelligence tasks. The one that finally struck a chord was — you guessed it! — metadata management. It seems they had colleagues with a lot of machine-generated data maintained in Hadoop and, upon reflection, they thought MarkLogic might be a good way to manage the metadata for same.

So should metadata management be handled relationally? Looking at my first three tests for when going relational is a slam-dunk choice:

I don’t think the application suites exploiting derived metadata are complex enough to support a strong pro-relational bias.
I don’t think the benefits of normalization are intense enough to mandate relational storage. (Also, since provenance matters, some of the traditional benefits of normalization are obviated — you may actually want out-of-date information in some cases.)
There certainly are some cases where you can set up a fixed schema, have one row of metadata per object, and be happy. In those cases, a relational database likely suffices, and is probably the right choice, but …

… I’m not sure how numerous the cases are where a simple, fixed database design isn’t a good fit. Thoughts?

Categories: Data models and architecture, Derived data, Hadoop, MarkLogic

Subscribe to our complete feed!

Comments

2 Responses to “Another category of derived data”

Dan Weinreb on May 31st, 2011 8:38 am

I’m a bit confused about the whole premise here. Some data can be derived and some is entirely additional. For example, if you store a digital photograph in your photo library folder, metadata like “how large is the photo” can be computed from the photo. But when I annotate it to say “This is Alice and Bob standing in front of the hotel we stayed at”, that can’t be derived. Usually when I hear the word “metadata”, the latter is what comes to my mind, although maybe that’s just me.
Hadapt update | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:09 am

[…] That all fits well with my thoughts about the importance of derived data. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Another category of derived data

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin