August 8, 2012

What kinds of metadata are important anyway?

In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.

Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.

In the relational world, there are at least three kinds of metadata:

As I see it:

I surely don’t know all the issues that people care about in data lineage, provenance, and history, but here are some of the ones that come to mind.

For each of those questions there are two parts to an answer:

Keeping the raw information is tantamount to storing

for

as well.

That actually isn’t crazy; it just boils down to keeping all your logs. Or, in cases where your logs are so big that it really is crazy, you can do some initial data reduction and then expect accurate history only from its conclusion onward. So the bigger problems lie, not in keeping the metadata, but rather in making it accessible. The two greatest theoretical challenges may be:

The first-biggest practical problem may be:

I don’t know what the state of theory (or practice) is about all that — which is exactly why I’ve written this post.

One last point — there’s a whole other level of metadata I’m ignoring in this post, because:

Comments

9 Responses to “What kinds of metadata are important anyway?”

  1. Mahesh Paolini-Subramanya on August 8th, 2012 8:46 am

    A *different* kind of meta-data – “what was the context around the creation of this particular database/solution/whatever??”
    This may not seem like much, but if there is any kind of organizational churn, you – the inheritor – inevitably end up staring at some aspect of the system with a puzzled look on your face wondering why, exactly, those two fields are linked (or not linked. you get the point)..
    Even worse, you end up with the “there must be a reason, leave it alone” approach – i.e., the database rot spreads…

  2. Curt Monash on August 8th, 2012 9:41 am

    Good point, Mahesh! But it fits with the stuff in my last paragraph, in that it doesn’t need to be technically well-integrated with the other stuff.

  3. Rico on August 9th, 2012 4:11 am

    Regarding statistics: I think, that stats are important for optimization, meaning that it’s more then just “nice to have”. And I do also think, that stats shouldn’t be limited to base table and column stats. It’s getting more and more important to have stats on expressions, which are reliable (independence assumption is often not good) and accurate.

  4. Curt Monash on August 9th, 2012 4:29 am

    Well, the design philosophy of Hadoop — and especially of Hive — seems to be that MPP performance is a nice-to-have, not a must-have. 😉

  5. Erich Hochmuth on August 9th, 2012 9:16 pm

    As Hadoop progresses the hype cycle at some point there will be backlash. This lack of metadata could be that cause. With out proper metadata management or governance if enterprises aren’t careful you could have decisions being made on secondary data sets that they have no idea how they were generated or what the original source was. Despite that, Hadoop is a very powerful tool when used appropriately.

  6. Andrew Purtell on August 10th, 2012 8:09 pm

    Speaking of HCatalog (which provides InputFormat and OutputFormat level plugins) and provenance, something like Stanford’s RAMP, which is also implemented with Hadoop’s interfaces, would be a natural extension: http://ilpubs.stanford.edu:8090/995/, at least for any workflow rendered as MapReduce jobs (Pig, Hive, plain MR, etc.).

    See also http://dl.acm.org/citation.cfm?id=2110501, an alternative approach to the problem but less “Hadoopish”, although the authors indicate follow up work will investigate using HBase for storage and query.

    Both results show significant runtime overheads (as high as 76% for RAMP, if I recall correctly). That and the additional storage requirements must be reckoned with. Aside from community priority considerations, I think this explains why Hadoop does not already incorporate such a facility. Never mind the challenges of data reduction for reasonable query side performance (some organizations will surely want the ability to backtrace to any input tuple from any of its downstream effects), simply capturing provenance at scale is hard.

  7. Curt Monash on August 11th, 2012 12:16 pm

    Andrew,

    Thanks! It’s clear that a logically complete provenance solution would impose huge overhead. If your further point is that the best known clever compromises still impose brutal overhead, I have no information to the contrary.

  8. What kinds of metadata are important anyway? « Another Word For It on August 11th, 2012 7:17 pm

    […] What kinds of metadata are important anyway? by Curt Monash. […]

  9. Shaun Connolly on August 15th, 2012 8:28 pm

    Hi Curt,

    Just catching up this thread. Good post!

    If Hadoop is to become a proper “data platform” it needs to be able to participate as a first class citizen in broader metadata management and data lineage scenarios. Metadata management solutions needs to span a wide range of data processing systems…including Hadoop.

    This is one reason why HCatalog is important. It provides a way to capture how the structure of data within Hadoop evolves as it works its way through the data pipeline (from initial raw structure to increasingly refined structures resulting from MapReduce, Pig, or Hive steps in the refining process).

    Before HCatalog, the only way to describe and store that metadata was to use Hive….but what if you want to use Pig or MapReduce (and not Hive)…then what? Enter HCatalog…the general Metadata Service for Hadoop.

    HCatalog begins to open up Hadoop in a way where it can integrate with broader metadata management and data lineage solutions from such folks as Informatica and others. There’s definitely more work to do, but the foundation is being laid down, which is an important step.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.