August 8, 2012

HCatalog — yes, it matters

To a first approximation, HCatalog is the thing that will make Hadoop play nicely with all kinds of ETLT (Extract/Transform/Load/Transform). However, HCatalog is both less and more than that:

Less, because some of the basic features people expect in HCatalog may well wind up in other projects instead.
More, in that HCatalog could (also) be regarded as a manifestation of efforts to make Hadoop/Hive more like a DBMS.

The base use case for HCatalog is:

You have an HDFS file, and know its record format.
You write DDL (Data Description Language) to create a virtual table whose columns look a lot like the HDFS file’s record fields. The language for this is HiveQL, Hive’s SQL-like language.
You interact with that table in any of variety of ways (batch operations only), including but not limited to:
- Hive.
- Pig.
- JDBC/ODBC.
- ETL tools such as Talend.
- DBMS integrations such as Teradata Aster’s SQL-H.

Major variants on that include:

The file may not exist until you write the DDL.
Conversely, the file may exist in binary format, and you may need to filter it before you have enough information to proceed.
The file can be partitioned and so on.
The whole thing can be based on a LOAD from some external structure.

I gather that most of the above is shipping today, and the rest is coming along nicely.

A key point is that you can change the file format, remap it to the virtual tables, and have your applications run unaltered. This is part of what I meant by making Hadoop “more like a DBMS.”

As Informatica in particular has pointed out, more metadata is needed in at least two areas:

Statistics about the data — data profiling and so on.
History of the data — notably, the stuff you need to meet compliance requirements.

The statistics lack has a clear path to being fixed, in that:

The HCatalog team has gotten the message that whatever statistics are available should be surfaced in HCatalog.
The Hive team, for reasons of query optimization, is working to capture a lot of statistics.*

The ETL and Hadoop communities need to talk more than they already have, but basically things seem to be on track.

*That’s another part of what I mean by saying that Hadoop/Hive are becoming more like a DBMS.

The history lack is a different matter. Information could be surfaced from Hadoop about many kinds of artifacts:

Data and files
Storage and blocks
Nodes
Processes
Etc.

for use in many kinds of time frame:

Real-time operation (automated)
Human real-time
Batch

If the data integration and compliance communities want their needs met any time soon, they may need to step up with some resources, energy, and even leadership.

In that vein, please see my companion post.

One last note — at this time, the integration of HBase into all this is less than one might think. Under development is the ability to let you make HCatalog “tables” based on HBase tables, as an alternative to basing them on HDFS files. But a manual process will be involved. And of course you could run into problems if your HBase tables have multi-valued fields.

Categories: Hadoop, HBase, Hortonworks, Informatica

Subscribe to our complete feed!

Comments

12 Responses to “HCatalog — yes, it matters”

What kinds of metadata are important anyway? | DBMS 2 : DataBase Management System Services on August 8th, 2012 7:27 am

[…] today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of […]
Hcatalog – Hadoop ja Hive enemmän tietokannan kaltaiseksi? « Olipa kerran Bigdata on August 9th, 2012 2:04 am

[…] Hcatalog – Hadoop ja Hive enemmän tietokannan kaltaiseksi? […]
Robert Hodges on August 10th, 2012 2:50 pm

One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?
What kinds of metadata are important anyway? « Another Word For It on August 11th, 2012 7:18 pm

[…] the post: In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of […]
Paul Johnson on August 13th, 2012 5:02 am

The over-riding theme is that this is another example of the various “efforts to make Hadoop/Hive more like a DBMS”, which makes a lot of sense. Long may it continue.
shash on January 9th, 2013 1:46 am

So, as of now, is it possible to integrate Hcatalog with Hbase?
Hadoop distributions | DBMS 2 : DataBase Management System Services on February 27th, 2013 6:41 am

[…] Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog. […]
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 7th, 2013 2:53 am

[…] of the problem being slowness in the metadata store. (I hope that that’s already improved in HCatalog, but I didn’t think to ask.) Hortonworks thinks 100 milliseconds would be a better […]
Hem on October 30th, 2013 3:42 am

Does IBM, Oracle and Hadapt offer HCatalog integration?
Distinctions in SQL/Hadoop integration | DBMS 2 : DataBase Management System Services on February 9th, 2014 1:50 pm

[…] HCatalog? […]
Spark on fire | DBMS 2 : DataBase Management System Services on April 30th, 2014 6:41 am

[…] Spark works with the Hive metastore, nee’ HCatalog. […]
Notes on HCatalog, Hive and Scalable Science | :: NickBurns on July 2nd, 2015 12:09 am

[…] “One interesting trend is NoSQL offerings that now include SQL interfaces. Hadoop and Cassandra do this. I attended a talk at the recent Cassandra Summit 2012 where the main developer of CQL (Cassandra’s SQL interface) discussed the motivations for using SQL on Cassandra. In a nutshell he said SQL (including language bindings/APIs) is robust, well though-out, and does what a lot of people what, so why reinvent something else?” Robert Hodges (2012). Source: http://www.dbms2.com/2012/08/08/hcatalog-yes-it-matters/ […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

HCatalog — yes, it matters

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin