In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.
Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.
In the relational world, there are at least three kinds of metadata:
- Definitional information about data structures, without which you can’t have a relational database at all. That area seems binary; either you have enough to make sense of your data or you don’t.
- Statistics about columns and tables, such as the most frequent values and how often they occur, which are kept for the purpose of optimization. Those seem to be nice-to-haves more than must-haves. The more information of this kind you have, the more chances you have to save resources.
- Historical and security information about data. This is where things get really complicated. It’s also where Hadoop is still in the “So what exactly should we build?” stage of design.
As I see it:
- Historical information about data answers questions in the realm of “Who did what to which data when?”
- Security information about data answers questions around “Who may do what to which data in the future?”
- They overlap because:
- They rely on closely related schemes for assessing roles and identity.
- Audit trails, a key aspect of security and compliance, could logically be viewed as falling in the realm of “history”.
I surely don’t know all the issues that people care about in data lineage, provenance, and history, but here are some of the ones that come to mind.
- Who created this data structure? When?
- Who and what modified it? How? When?
- Where is the raw log that created this record, or associated transaction log file? What’s the metadata for it?
- What were all the inputs that were presupposed when this record was created? And by the way, could we please have all the same metadata about each of them?
For each of those questions there are two parts to an answer:
- Keeping the raw information.
- Making it available.
Keeping the raw information is tantamount to storing
- Every previous state of the database (or information sufficient to recreate it from)
- Every action that was taken to change it
- The end database
- Any intermediate databases along the way
- Any associated administrative databases (e.g. those covering rules and permissions)
That actually isn’t crazy; it just boils down to keeping all your logs. Or, in cases where your logs are so big that it really is crazy, you can do some initial data reduction and then expect accurate history only from its conclusion onward. So the bigger problems lie, not in keeping the metadata, but rather in making it accessible. The two greatest theoretical challenges may be:
- Summarizing all that information so that it can be accessed it quickly, without sacrificing (too much) accuracy.
- Figure out a good data model for it. (The most natural model feels like a very multi-valued graph, or perhaps a hybrid between a graph and a NoSQL column store.)
The first-biggest practical problem may be:
- Figuring out a suitable format …
- … without being delayed by a conventional standards-setting process, …
- … or getting locked into some vendor-proprietary format …
- … or settling for something so simplistic it
I don’t know what the state of theory (or practice) is about all that — which is exactly why I’ve written this post.
One last point — there’s a whole other level of metadata I’m ignoring in this post, because:
- I think it’s too new to judge accurately for its importance or lack thereof.
- It seems straightforward to add to the rest, without imposing new burdens on design or performance.