To a first approximation, HCatalog is the thing that will make Hadoop play nicely with all kinds of ETLT (Extract/Transform/Load/Transform). However, HCatalog is both less and more than that:
- Less, because some of the basic features people expect in HCatalog may well wind up in other projects instead.
- More, in that HCatalog could (also) be regarded as a manifestation of efforts to make Hadoop/Hive more like a DBMS.
The base use case for HCatalog is:
- You have an HDFS file, and know its record format.
- You write DDL (Data Description Language) to create a virtual table whose columns look a lot like the HDFS file’s record fields. The language for this is HiveQL, Hive’s SQL-like language.
- You interact with that table in any of variety of ways (batch operations only), including but not limited to:
Major variants on that include:
- The file may not exist until you write the DDL.
- Conversely, the file may exist in binary format, and you may need to filter it before you have enough information to proceed.
- The file can be partitioned and so on.
- The whole thing can be based on a LOAD from some external structure.
I gather that most of the above is shipping today, and the rest is coming along nicely.
A key point is that you can change the file format, remap it to the virtual tables, and have your applications run unaltered. This is part of what I meant by making Hadoop “more like a DBMS.”
As Informatica in particular has pointed out, more metadata is needed in at least two areas:
- Statistics about the data — data profiling and so on.
- History of the data — notably, the stuff you need to meet compliance requirements.
The statistics lack has a clear path to being fixed, in that:
- The HCatalog team has gotten the message that whatever statistics are available should be surfaced in HCatalog.
- The Hive team, for reasons of query optimization, is working to capture a lot of statistics.*
The ETL and Hadoop communities need to talk more than they already have, but basically things seem to be on track.
*That’s another part of what I mean by saying that Hadoop/Hive are becoming more like a DBMS.
The history lack is a different matter. Information could be surfaced from Hadoop about many kinds of artifacts:
- Data and files
- Storage and blocks
for use in many kinds of time frame:
- Real-time operation (automated)
- Human real-time
If the data integration and compliance communities want their needs met any time soon, they may need to step up with some resources, energy, and even leadership.
In that vein, please see my companion post.
One last note — at this time, the integration of HBase into all this is less than one might think. Under development is the ability to let you make HCatalog “tables” based on HBase tables, as an alternative to basing them on HDFS files. But a manual process will be involved. And of course you could run into problems if your HBase tables have multi-valued fields.