May 30, 2011

Another category of derived data

Six months ago, I argued the importance of derived analytic data, saying

… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

  • Aggregates, when they are maintained, generally for reasons of performance or response time.
  • Calculated scores, commonly based on data mining/predictive analytics.
  • Text analytics.
  • The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
  • Adjusted data, especially in scientific contexts.

Probably there are yet more examples that I am at the moment overlooking.

Well, I did overlook at least one category. 🙂

A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.

Actually, what made me think of writing this post was a few conversations at MarkLogic’s April user conference. For example, MarkLogic likes to break lunch up into subject-specific tables, hosted either by a partner company, or by one of the analysts who is attending anyway. So they asked me to hold a table about having Hadoop and MarkLogic work together. When I showed up, I discovered that most of the users at the table worked for a single organization; what’s more, they were skeptical about the table’s discussion subject, and wanted to be see if I could persuade them otherwise. I gently pointed out that I hadn’t actually picked the subject, and asked them what their use cases might be like. Those turned out to be classified …

… but have no fear! Your hero thought quickly, and soon was holding forth about various ways one might combine the two technologies for various intelligence tasks. The one that finally struck a chord was — you guessed it! — metadata management. It seems they had colleagues with a lot of machine-generated data maintained in Hadoop and, upon reflection, they thought MarkLogic might be a good way to manage the metadata for same.

So should metadata management be handled relationally? Looking at my first three tests for when going relational is a slam-dunk choice:

… I’m not sure how numerous the cases are where a simple, fixed database design isn’t a good fit. Thoughts?


2 Responses to “Another category of derived data”

  1. Dan Weinreb on May 31st, 2011 8:38 am

    I’m a bit confused about the whole premise here. Some data can be derived and some is entirely additional. For example, if you store a digital photograph in your photo library folder, metadata like “how large is the photo” can be computed from the photo. But when I annotate it to say “This is Alice and Bob standing in front of the hotel we stayed at”, that can’t be derived. Usually when I hear the word “metadata”, the latter is what comes to my mind, although maybe that’s just me.

  2. Hadapt update | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:09 am

    […] That all fits well with my thoughts about the importance of derived data. […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.