What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
*With no “fat head”.
Impala is of course a young system, and very much a work in progress. It has a variety of limitations in functionality, performance, and so on, many (all?) of which are slated to be addressed down the road. While different individuals may espouse different views at different times, I think it’s not too misleading to summarize Cloudera’s strategic positioning for Impala as:
- A core use case for Hadoop is to process or transform data. SQL can help with that, and hence so can Impala.
- A core use case for Hadoop is machine learning. SQL can help with that, and hence so can Impala.
- Both due to its Hadoop integration and other features, HBase is getting significant usage. You might want to do SQL against your HBase data. Impala can help with that.
- Some enterprises choose to have much large clusters for Hadoop than they do for their relational DBMS. For them, Impala may give pretty good analytic SQL performance, by throwing hardware at the problem.
Thinking about Impala performance is confusing, on any level of detail beyond:
- Impala is going to be (much) faster than Hive …
- … but slower than a serious and more mature analytic RDBMS.
But let’s try anyway.
As of the initial Impala release(s):
- Impala will run against a variety of storage managers, choices among which will have different performance implications. HDFS (Hadoop Distributed File System) and HBase will both be supported. Multiple HDFS formats will be supported, both row-based and columnar. (See the Trevni comments in my first Impala post.)
- In the simplest of scanning scenarios, Impala can read row-based data at near the theoretically optimum speed, while Hive runs at 1/3 of that.
- Initially, all Impala joins will be (distributed) hash joins. These seem to start at 10X Hive’s performance and go up from there.
- The fastest Impala queries take > 1 second.
- One test showed Impala surviving a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%ish performance degradation.
- Impala will have Microstrategy support on Day 1, so it obviously can handle fairly complex SQL. (Also Pentaho, Tableau, and QlikView.)
- Column statistics and the like are under active development, which will help in query optimization. A true cost-based optimizer is, of course, further off.
Cloudera’s marketing name for Impala will be “Real Time Query”, but seems a dubious match to early-release Impala reality.
In many cases, the best Impala performance — and indeed the best Hadoop performance overall — will probably come over Trevni, which Cloudera believes will be 30% or so faster than the current columnar option RCFile. This led me to inquire how data would get into Trevni, presuming that it’s initially loaded into some other format. Cloudera is hoping to have a background process for that available Day 1, but I have no details about it. (The other alternative would be to do a batch MapReduce job.) Cloudera also points out that both Flume and HBase can get data into Hadoop with very low latency.
Given the obvious potential synergy between Impala — a specialized alternative to MapReduce — and YARN, Cloudera has redoubled its efforts to (help) get YARN up to production quality.
Finally, there’s the question of what Impala actually does. In its initial release, it will support a large, strict subset of Hive functionality. That helps with reusing a lot of Hive infrastructure and connectivity, of course. But it also means that you don’t have real updates; rather, you load in bulk. Similarly, there’s a lot of analytic SQL functionality that’s not directly supported. Down the road, it’s reasonable to expect Impala functionality to extend in (at least) two directions:
- More SQL capability.
- Dremel-like capability to handle nested data structures.