What kinds of metadata are important anyway?
In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.
Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.
In the relational world, there are at least three kinds of metadata:
- Definitional information about data structures, without which you can’t have a relational database at all. That area seems binary; either you have enough to make sense of your data or you don’t.
- Statistics about columns and tables, such as the most frequent values and how often they occur, which are kept for the purpose of optimization. Those seem to be nice-to-haves more than must-haves. The more information of this kind you have, the more chances you have to save resources.
- Historical and security information about data. This is where things get really complicated. It’s also where Hadoop is still in the “So what exactly should we build?” stage of design.
As I see it:
- Historical information about data answers questions in the realm of “Who did what to which data when?”
- Security information about data answers questions around “Who may do what to which data in the future?”
- They overlap because:
- They rely on closely related schemes for assessing roles and identity.
- Audit trails, a key aspect of security and compliance, could logically be viewed as falling in the realm of “history”.
I surely don’t know all the issues that people care about in data lineage, provenance, and history, but here are some of the ones that come to mind.
- Who created this data structure? When?
- Who and what modified it? How? When?
- Where is the raw log that created this record, or associated transaction log file? What’s the metadata for it?
- What were all the inputs that were presupposed when this record was created? And by the way, could we please have all the same metadata about each of them?
For each of those questions there are two parts to an answer:
- Keeping the raw information.
- Making it available.
Keeping the raw information is tantamount to storing
- Every previous state of the database (or information sufficient to recreate it from)
- Every action that was taken to change it
for
- The end database
- Any intermediate databases along the way
- Any associated administrative databases (e.g. those covering rules and permissions)
as well.
That actually isn’t crazy; it just boils down to keeping all your logs. Or, in cases where your logs are so big that it really is crazy, you can do some initial data reduction and then expect accurate history only from its conclusion onward. So the bigger problems lie, not in keeping the metadata, but rather in making it accessible. The two greatest theoretical challenges may be:
- Summarizing all that information so that it can be accessed it quickly, without sacrificing (too much) accuracy.
- Figure out a good data model for it. (The most natural model feels like a very multi-valued graph, or perhaps a hybrid between a graph and a NoSQL column store.)
The first-biggest practical problem may be:
- Figuring out a suitable format …
- … without being delayed by a conventional standards-setting process, …
- … or getting locked into some vendor-proprietary format …
- … or settling for something so simplistic it
I don’t know what the state of theory (or practice) is about all that — which is exactly why I’ve written this post.
One last point — there’s a whole other level of metadata I’m ignoring in this post, because:
- I think it’s too new to judge accurately for its importance or lack thereof.
- It seems straightforward to add to the rest, without imposing new burdens on design or performance.
Comments
9 Responses to “What kinds of metadata are important anyway?”
Leave a Reply
A *different* kind of meta-data – “what was the context around the creation of this particular database/solution/whatever??”
This may not seem like much, but if there is any kind of organizational churn, you – the inheritor – inevitably end up staring at some aspect of the system with a puzzled look on your face wondering why, exactly, those two fields are linked (or not linked. you get the point)..
Even worse, you end up with the “there must be a reason, leave it alone” approach – i.e., the database rot spreads…
Good point, Mahesh! But it fits with the stuff in my last paragraph, in that it doesn’t need to be technically well-integrated with the other stuff.
Regarding statistics: I think, that stats are important for optimization, meaning that it’s more then just “nice to have”. And I do also think, that stats shouldn’t be limited to base table and column stats. It’s getting more and more important to have stats on expressions, which are reliable (independence assumption is often not good) and accurate.
Well, the design philosophy of Hadoop — and especially of Hive — seems to be that MPP performance is a nice-to-have, not a must-have. 😉
As Hadoop progresses the hype cycle at some point there will be backlash. This lack of metadata could be that cause. With out proper metadata management or governance if enterprises aren’t careful you could have decisions being made on secondary data sets that they have no idea how they were generated or what the original source was. Despite that, Hadoop is a very powerful tool when used appropriately.
Speaking of HCatalog (which provides InputFormat and OutputFormat level plugins) and provenance, something like Stanford’s RAMP, which is also implemented with Hadoop’s interfaces, would be a natural extension: http://ilpubs.stanford.edu:8090/995/, at least for any workflow rendered as MapReduce jobs (Pig, Hive, plain MR, etc.).
See also http://dl.acm.org/citation.cfm?id=2110501, an alternative approach to the problem but less “Hadoopish”, although the authors indicate follow up work will investigate using HBase for storage and query.
Both results show significant runtime overheads (as high as 76% for RAMP, if I recall correctly). That and the additional storage requirements must be reckoned with. Aside from community priority considerations, I think this explains why Hadoop does not already incorporate such a facility. Never mind the challenges of data reduction for reasonable query side performance (some organizations will surely want the ability to backtrace to any input tuple from any of its downstream effects), simply capturing provenance at scale is hard.
Andrew,
Thanks! It’s clear that a logically complete provenance solution would impose huge overhead. If your further point is that the best known clever compromises still impose brutal overhead, I have no information to the contrary.
[…] What kinds of metadata are important anyway? by Curt Monash. […]
Hi Curt,
Just catching up this thread. Good post!
If Hadoop is to become a proper “data platform” it needs to be able to participate as a first class citizen in broader metadata management and data lineage scenarios. Metadata management solutions needs to span a wide range of data processing systems…including Hadoop.
This is one reason why HCatalog is important. It provides a way to capture how the structure of data within Hadoop evolves as it works its way through the data pipeline (from initial raw structure to increasingly refined structures resulting from MapReduce, Pig, or Hive steps in the refining process).
Before HCatalog, the only way to describe and store that metadata was to use Hive….but what if you want to use Pig or MapReduce (and not Hive)…then what? Enter HCatalog…the general Metadata Service for Hadoop.
HCatalog begins to open up Hadoop in a way where it can integrate with broader metadata management and data lineage solutions from such folks as Informatica and others. There’s definitely more work to do, but the foundation is being laid down, which is an important step.