February 9, 2014

Distinctions in SQL/Hadoop integration

Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):

Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- HCatalog?
How, if at all, is the SQL engine invoked by Hadoop?

In particular:

If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.

Let’s go to some examples.

Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.

Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.

Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.

Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.

Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.

*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.

Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.

Related links

Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase.
My most detailed discussions of Impala and Stinger were last June and August, respectively.

Categories: Cloudera, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Hadapt, Hadoop, HBase, Hortonworks, MapReduce, Microsoft and SQL*Server, NewSQL, PostgreSQL, SQL/Hadoop integration, Teradata

Subscribe to our complete feed!

Comments

36 Responses to “Distinctions in SQL/Hadoop integration”

Dong Jiang on February 9th, 2014 4:25 pm

How does HAWQ and Presto fit in?
Curt Monash on February 9th, 2014 5:08 pm

I get the impression that Presto is comparable to Impala.

As for Hawq, I don’t know a lot past http://www.dbms2.com/2013/02/25/greenplum-hawq-pivotal-hd/
Matt Brandwein on February 9th, 2014 10:51 pm

Hi Curt,

A couple of relevant distinctions re: Stinger (and I think my colleagues at Hortonworks would agree):

1) Stinger is not a technology. It is Hortonworks’ brand name for their roadmap/project to improve Hive, inclusive of Tez and other work. The Hive community, in this case, includes Hortonworks employees as well as many others who work on Hive, including several Cloudera developers.

2) re: “(and presumably also Apache’s)” This statement, as written, isn’t valid. The Apache Software Foundation (ASF) is basically a governance organization for collaborative development; it does not endorse or drive any particular Apache project. The market at-large and the developer community choose standards and preferred technology.

re: Impala and Hive, I feel compelled to mention that Impala complements Hive, rather than replaces it (today). Put simply:
* Impala is the preferred solution for fast, interactive SQL.
* Hive remains a preferred solution for batch SQL.
* We see Spark emerging as an attractive new option for fast batch processing and analytics.

Further, Hive is not disappearing from CDH. It’s an important capability that our customers rely on and we will continue supporting it as an enterprise software vendor.
Mark Callaghan on February 9th, 2014 11:05 pm

Would be nice to read your opinion of Impala vs Presto.
Curt Monash on February 9th, 2014 11:15 pm

It would be nice to have an opinion about Impala vs Presto. However, I’ve never talked with anybody about Presto, nor otherwise gotten much information about it.
Marcel Kornacker on February 10th, 2014 2:30 am

Mark, we have run some benchmark workloads against Presto and will post the results shortly.
Hyeongchae, Lee on February 10th, 2014 11:06 am

Hi Curt,

How do you think about Tajo?

http://tajo.incubator.apache.org/
Curt Monash on February 10th, 2014 11:12 am

Sorry, but nobody has convinced me I should care about Tajo.

Meanwhile, I realize I forgot to mention my clients at Citus Data. In my defense, I haven’t chatted with them for a few months to catch up on the latest, but to a first approximation we can think of them as somewhat like Hadapt.
John on February 10th, 2014 1:22 pm

Very good coverage as always Curt. Sounds like Splicemachine will be more SQL on Hbase?

The bigger question is what problem does SQL on hadoop really solve?

thanks
Vlad Rodionov on February 10th, 2014 1:32 pm

>> Would be nice to read your opinion of Impala vs Presto.

Presto level of Hive compatibility is much lower at this time. Presto is moving in its own direction and this may be not a right decision.
Curt Monash on February 10th, 2014 2:19 pm

Vlad,

Is Presto’s problem Hive compatibility or ANSI SQL compatibility? If it’s only the former, I fail to see why that’s bad.
Curt Monash on February 10th, 2014 2:21 pm

Correct, John. Splice Machine is “on Hadoop” only insofar as it is on HBase.

That certainly saves the vendor some resources, but I don’t know what it does directly for the customer.

Perhaps in the future they’ll imitate other DBMS running on Hadoop clusters, by saying “Well, if we’re there anyway, let’s see how we can upgrade our interoperability.”
Vlad Rodionov on February 10th, 2014 5:42 pm

>>Is Presto’s problem Hive compatibility or ANSI >>SQL compatibility? If it’s only the former, I >>fail to see why that’s bad.
Both, I think. I have not spent much time comparing HiveQL and PrestoSQL though. HiveQL is becoming standard de-facto for SQL-on-Hadoop. UDF, UDAF, UDTF in Java and embedded MapReduce are not part of SQL specification, but are supported by HiveQL. Impala recently introduced UDF, UDAF support. Presto, to my best knowledge, do not support these features yet.
Monte Zweben on February 10th, 2014 6:45 pm

HBase enables real-time, small updates to records stored on Hadoop. By leveraging, HBase and some of its newer features like co-processors, Splice Machine is able to bring a complete parallelized ANSI SQL to the Hadoop community with full CRUD operations and ACID transactions. Instead of being just a batch repository for data science — the Splice Machine database can power real-time applications. For example, one customer is replacing Oracle in a third-party campaign management application with Spice Machine. This requires both OLAP and OLTP features.
Curt Monash on February 10th, 2014 7:02 pm

Monte,

The more good scale-out NewSQL offerings, the better. What I’m still not convinced of, however, is why the “Hadoop community” should care any more about Splice Machine than the rest of the world does.

Anyhow, I look forward to understanding your pitch as to why Splice Machine is a good NewSQL offering. We didn’t get around to that part when we talked.
Glen Sheffield on February 13th, 2014 10:34 am

Hi Curt, just to mention that IBM has Big SQL which ships as part of the BigInsights Hadoop distribution. This is an SQL on Hadoop implementation based on your distinction. It has pretty good SQL support with elements of SQL 92, SQL 99, SQL 2003. It uses Hive metadata, and any existing Hive tables are immediately queryable. More info here
http://www.ibm.com/developerworks/library/bd-bigsql/index.html?S_TACT=105AGX99&S_CMP=CP
John on February 14th, 2014 11:58 am

Curt, Interesting discussion. Would love to know your thoughts on the above comment you made “More Newsql offerings, the better”. Why do you think more is better for Newsql and is Splice solving similar problem to Newsql vendors and can it match the performance of Newsql on Hadoop?
Curt Monash on February 14th, 2014 6:09 pm

John,

Splice Machine offers a new SQL RDBMS that they think is not-just-analytic, and which scales out. (And so far as I can tell nothing else.) That’s dead center in the NewSQL category.

The NewSQL vendors actually haven’t accomplished all that much to date, collectively, at least when compared with the NoSQL community. So if somebody wants to try to do better, bring it on. I’m skeptical, because of the difficulties — but it’s not my time nor money. So knock yourself out!
Hyunsik Choi on February 14th, 2014 6:31 pm

Hi Curt,

Tajo is different from Hadapt. Tajo does not use another underlying DBMS system. It is developed in a from-the-scratch manner. It was an academic prototype. You can find its origin at http://www.computer.org/csdl/proceedings/icde/2013/4909/00/06544934-abs.html. Now, Tajo as one of Apache incubation projects. It has been actively developed with various team members.

I’m going to explain briefly its architecture in terms of local execution engine and distributed execution engine.

Tajo has its own high thoughput execution engine specialized for analytical processing. It is implemented in Jara. It is very similar to RDBMS’s one and includes various physical executors, such as hash join, merge join, sort aggregation, hash aggregation, and external sort. Each local execution engine must be placed on the same physical node at which HDFS data node is running. Like MapReduce, It directly reads HDFS, and makes use of the locality of HDFS blocks.

Even though it is a Hadoop-based system, Tajo does not use MapReduce and Tez. Tajo has its own distributed execution specialized for relational processing. The distributed execution engine internally manages a running query as a DAG of query fragments. It provides hash/range repartitioning for exchaning data between nodes.

Interestingly, some key features of distributed layer follows the philosophy of MapReduce. it mainly uses pull-based data transmission for repartitions. It includes rescheduling based fault-tolerant and dynamic scheduling which MapReduce does.

When it comes to workloads, Tajo is not specified for a certain workload like interactive or batch. I daringly assert that Tajo works well for both low latency and long-time batch queries. Tajo guys see that it is just a query optimization problem from various optimization opportunities. They believe that Tajo can do.
Thomas V on February 15th, 2014 5:53 am

Hi,

There is another SQL-on-HBase initiative, Apache Phoenix. It relies on coprocessors too, turning HBase (+ the jdbc client) into a parallel DB. There is support for secondary indices and “virtual columns” – columns that are filled by other processes and are thus not registered into Phoenix’s metadata.

@Vlad: ANSI SQL support is still sought for because it’s key to integrating with traditional BI tools that are not fully Hadoop-aware. As an example, take IBM SPSS: they have just (dec. 2013) released a version with native support for Hive. Too bad, it only supports Hadoop 1.x and they release a new version every year. By the time it supports 2.x, the YARN/Tez/Spark/whatever landscape will have changed once or twice, again.
Asaf Birenzvieg on February 17th, 2014 12:32 pm

Hi Curt,
I recommend to take a look at http://jethrodata.com
John on February 18th, 2014 5:29 pm

Curt,

Curious to know if you will start covering NoSQL vendors also?
Curt Monash on February 18th, 2014 5:33 pm

John,

Not likely, as I would have to stop before I could re-start, and I don’t plan on stopping any time soon.

See for example http://www.dbms2.com/category/database-theory-practice/nosql/, or the specific sections on MongoDB, Cassandra, DataStax, Couchbase et al.
Cloudera, Impala, data warehousing and Hive | DBMS 2 : DataBase Management System Services on May 1st, 2014 11:37 pm

[…] survey of SQL/Hadoop integration (February, […]
Sachin Raghunath on May 13th, 2014 9:54 am

Is there a way where I can use some kind of abstraction layer to write a query. If the latest 1 year data is in teradata and previous 9 years to it is in Hadoop, then in such a case, can a query is possible to first check for data in teradata and if not found, then run the same query on Hadoop to get the data. Are such kind of abstractions available.
Curt Monash on May 13th, 2014 11:13 am

Sure. The only questions are performance and SQL functionality coverage. Most DBMS have some sort of External Tables capability. There also are third-party query brokers such as Cirro.

In the specific case of Teradata and Hadoop, there’s SQL-H.
Madhavan on August 18th, 2014 1:19 pm

Curt,

From what I understand, fixed schema systems are good analytic DBs but not good for polystructured data. Dynamic schema systems are good for poly structured data but not good for analytics (but only for short-requests).

Have I got that right?

If so, is SQL on Hadoop the only answer to analytics on dynamic schema, polystructured data?

Is there room for a NoSQL DB that can handle dynamic schema but also function as a full-steck analytics DB? Is that even practically possible?
Curt Monash on August 18th, 2014 1:35 pm

Madhavan,

I try to avoid stating generalities that absolutely. 🙂 But I would say that if your analytics involve a lot of joins, then for performance you will likely need something that bears a reasonable resemblance to an analytic RDBMS.

Anyhow, the Wibidata/Kiji port to Cassandra — which also now has Spark in the mix — would be an example of an analytics stack that involves neither a tabular DBMS nor Hadoop. And if you check my posts on schema-on-need, it doesn’t assume Hadoop (consider for example Vertica).
Madhavan on August 18th, 2014 2:19 pm

Thanks Curt.

Here is my problem with the whole Schema-on-Read business. Apart from query performance, there is that issue of not having fact tables to help me understand data. Materializing columns comes after schema has been read, which to me is a big pain point.

That is why I think a physical, NoSQL analytic DB makes whole lot more sense for polystructured, dynamic schema data than virtual schema-on-read/ need systems. Sure, in some cases with deep nested structures, flattening it would lead to a large number of columns. But for most enterprises, that would seem to be an outlier.

What do you think?
Curt Monash on August 19th, 2014 8:39 pm

Madhavan,

I suspect you of thinking that physical and logical architectures need to be more tightly correlated than they actually do.

To put it another way — I don’t understand what you’re saying, and I suspect that’s because you’re making assumptions I don’t share, and what I said above points to a guess as to what those assumptions might be.
Madhavan on August 28th, 2014 12:35 pm

Let me try to explain:

In RDBMS or in SQL-on-Hadoop, one has to define a bounded/ fixed logical model (or schema) to describe data before analytics can be done. (IMO, SQL on Hadoop just defers the pain of schema definition).

Query optimization is directly linked to schema and stats about data in the schema (hence why logical and physical architectures are not as divorced as one would like).

To me, this approach is (1) laborious and (2) limited to the data that the schema is capable of serving.

I am thinking of a DB that accommodates dynamic schema – like NoSQL and is capable of highly optimized joins, like SQL.

Holy grail, much?
Curt Monash on August 30th, 2014 11:04 am

If you want to use SQL, you surely will want to include FROM clauses. But what those FROM clauses point at can be rather different in different systems.

Performance, I think, generally depends upon speed of evaluation of WHERE clauses, and that speed can vary wildly depending upon optimizations.

OK. What else are you asking or saying?
Madhavan on September 1st, 2014 8:42 am

That is a nice way of putting it.

Here is what I am saying: In all DW constructs today, structure of data (where is what) & logic of data (what are entities and how are they related) are tightly coupled. Some are optimized for handling dynamic structures (like NoSQL) and others for dynamic logic (like RDBMS). It is an either/or scenario and I think this is because of this tight coupling between structure and logic. (SQL on Hadoop tries to optimize both and falls short in both – because you still need to discover/ define a bounded structure & logic at query time plus you don’t have the advantage of well defined query plans).

How about separating the two? Imagine a system that is structure-aware at ingestion but not burdened with relational logic yet.

Define relational queries on the fly because there is enough structure to support FROM operations and to guide WHERE operations. ALSO, use the knowledge of structure to optimize queries with intelligent/ dynamic query routing.
Tim Fellows on November 7th, 2014 2:24 am

@Madhaven

Per your Holy Grail comment. Apache Drill project may be what you’re looking for.
Antonio Romero on December 4th, 2014 2:37 pm

Sachin Raghunath: What you’re looking for is at a basic level federated queries… or at a more complete level, data virtualization. One mature solution for your particular use case is Cisco Information Server, formerly Composite Information Server (disclosure: I’m one of the PMs, so I’m partial). There are others that I won’t name in detail (since they don’t feed me) but a little Googling will quickly reveal the contenders in this area. Wikipedia has a good article on DV: http://en.wikipedia.org/wiki/Data_virtualization
Curt Monash on December 4th, 2014 3:12 pm

I’m going to jump in here and remind folks that terminology around “data virtualization” can be a real mess — http://www.dbms2.com/2013/01/05/database-virtualization-data/

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Distinctions in SQL/Hadoop integration

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin