Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):
- Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
- How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- How, if at all, is the SQL engine invoked by Hadoop?
- If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
- If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.
Let’s go to some examples.
Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.
Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.
Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.
Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.
Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.
*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.
Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.
- Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase.
- My most detailed discussions of Impala and Stinger were last June and August, respectively.