HCatalog — yes, it matters
To a first approximation, HCatalog is the thing that will make Hadoop play nicely with all kinds of ETLT (Extract/Transform/Load/Transform). However, HCatalog is both less and more than that:
- Less, because some of the basic features people expect in HCatalog may well wind up in other projects instead.
- More, in that HCatalog could (also) be regarded as a manifestation of efforts to make Hadoop/Hive more like a DBMS.
The base use case for HCatalog is:
- You have an HDFS file, and know its record format.
- You write DDL (Data Description Language) to create a virtual table whose columns look a lot like the HDFS file’s record fields. The language for this is HiveQL, Hive’s SQL-like language.
- You interact with that table in any of variety of ways (batch operations only), including but not limited to:
- Hive.
- Pig.
- JDBC/ODBC.
- ETL tools such as Talend.
- DBMS integrations such as Teradata Aster’s SQL-H.
Major variants on that include:
- The file may not exist until you write the DDL.
- Conversely, the file may exist in binary format, and you may need to filter it before you have enough information to proceed.
- The file can be partitioned and so on.
- The whole thing can be based on a LOAD from some external structure.
I gather that most of the above is shipping today, and the rest is coming along nicely.
A key point is that you can change the file format, remap it to the virtual tables, and have your applications run unaltered. This is part of what I meant by making Hadoop “more like a DBMS.”
As Informatica in particular has pointed out, more metadata is needed in at least two areas:
- Statistics about the data — data profiling and so on.
- History of the data — notably, the stuff you need to meet compliance requirements.
The statistics lack has a clear path to being fixed, in that:
- The HCatalog team has gotten the message that whatever statistics are available should be surfaced in HCatalog.
- The Hive team, for reasons of query optimization, is working to capture a lot of statistics.*
The ETL and Hadoop communities need to talk more than they already have, but basically things seem to be on track.
*That’s another part of what I mean by saying that Hadoop/Hive are becoming more like a DBMS.
The history lack is a different matter. Information could be surfaced from Hadoop about many kinds of artifacts:
- Data and files
- Storage and blocks
- Nodes
- Processes
- Etc.
for use in many kinds of time frame:
- Real-time operation (automated)
- Human real-time
- Batch
If the data integration and compliance communities want their needs met any time soon, they may need to step up with some resources, energy, and even leadership.
In that vein, please see my companion post.
One last note — at this time, the integration of HBase into all this is less than one might think. Under development is the ability to let you make HCatalog “tables” based on HBase tables, as an alternative to basing them on HDFS files. But a manual process will be involved. And of course you could run into problems if your HBase tables have multi-valued fields.
Comments
12 Responses to “HCatalog — yes, it matters”
Leave a Reply
[…] today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of […]
[…] Hcatalog – Hadoop ja Hive enemmän tietokannan kaltaiseksi? […]
One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?
[…] the post: In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of […]
The over-riding theme is that this is another example of the various “efforts to make Hadoop/Hive more like a DBMS”, which makes a lot of sense. Long may it continue.
So, as of now, is it possible to integrate Hcatalog with Hbase?
[…] Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog. […]
[…] of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better […]
Does IBM, Oracle and Hadapt offer HCatalog integration?
[…] HCatalog? […]
[…] Spark works with the Hive metastore, nee’ HCatalog. […]
[…] “One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?” Robert Hodges (2012). Source: http://www.dbms2.com/2012/08/08/hcatalog-yes-it-matters/ […]