November 1, 2012

More on Cloudera Impala

What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.

First, some basics:

Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
Impala is in public beta, and is targeted for general availability Q1 2013 or so.
Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
… notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.

The general technical idea of Impala is:

It’s an additional daemon that runs on each of your Hadoop nodes.
Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
Impala operates as a distributed parallel analytic DBMS.*
Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.

*With no “fat head”.

Impala is of course a young system, and very much a work in progress. It has a variety of limitations in functionality, performance, and so on, many (all?) of which are slated to be addressed down the road. While different individuals may espouse different views at different times, I think it’s not too misleading to summarize Cloudera’s strategic positioning for Impala as:

A core use case for Hadoop is to process or transform data. SQL can help with that, and hence so can Impala.
A core use case for Hadoop is machine learning. SQL can help with that, and hence so can Impala.
Both due to its Hadoop integration and other features, HBase is getting significant usage. You might want to do SQL against your HBase data. Impala can help with that.
Some enterprises choose to have much large clusters for Hadoop than they do for their relational DBMS. For them, Impala may give pretty good analytic SQL performance, by throwing hardware at the problem.

Thinking about Impala performance is confusing, on any level of detail beyond:

Impala is going to be (much) faster than Hive …
… but slower than a serious and more mature analytic RDBMS.

But let’s try anyway.

As of the initial Impala release(s):

Impala will run against a variety of storage managers, choices among which will have different performance implications. HDFS (Hadoop Distributed File System) and HBase will both be supported. Multiple HDFS formats will be supported, both row-based and columnar. (See the Trevni comments in my first Impala post.)
In the simplest of scanning scenarios, Impala can read row-based data at near the theoretically optimum speed, while Hive runs at 1/3 of that.
Initially, all Impala joins will be (distributed) hash joins. These seem to start at 10X Hive’s performance and go up from there.
The fastest Impala queries take > 1 second.
One test showed Impala surviving a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%ish performance degradation.
Impala will have Microstrategy support on Day 1, so it obviously can handle fairly complex SQL. (Also Pentaho, Tableau, and QlikView.)
Column statistics and the like are under active development, which will help in query optimization. A true cost-based optimizer is, of course, further off.

Cloudera’s marketing name for Impala will be “Real Time Query”, but seems a dubious match to early-release Impala reality.

In many cases, the best Impala performance — and indeed the best Hadoop performance overall — will probably come over Trevni, which Cloudera believes will be 30% or so faster than the current columnar option RCFile. This led me to inquire how data would get into Trevni, presuming that it’s initially loaded into some other format. Cloudera is hoping to have a background process for that available Day 1, but I have no details about it. (The other alternative would be to do a batch MapReduce job.) Cloudera also points out that both Flume and HBase can get data into Hadoop with very low latency.

Given the obvious potential synergy between Impala — a specialized alternative to MapReduce — and YARN, Cloudera has redoubled its efforts to (help) get YARN up to production quality.

Finally, there’s the question of what Impala actually does. In its initial release, it will support a large, strict subset of Hive functionality. That helps with reusing a lot of Hive infrastructure and connectivity, of course. But it also means that you don’t have real updates; rather, you load in bulk. Similarly, there’s a lot of analytic SQL functionality that’s not directly supported. Down the road, it’s reasonable to expect Impala functionality to extend in (at least) two directions:

More SQL capability.
Dremel-like capability to handle nested data structures.

Categories: Cloudera, Data models and architecture, Data warehousing, Hadoop, HBase, MapReduce, Open source, Predictive modeling and advanced analytics, SQL/Hadoop integration

Subscribe to our complete feed!

Comments

12 Responses to “More on Cloudera Impala”

Notes and comments — October 31, 2012 | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:13 am

[…] 4. Stay tuned for more on Cloudera Impala. (Edit: Now posted.) […]
Quick notes on Impala | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:17 am

[…] There is now a follow-up post on Cloudera Impala with substantially more […]
Al DeLosSantos on November 1st, 2012 11:02 am

Thanks again to your and your audience for very helpful posts and comments Curt. I also found this historical post and discussion thread useful when I was searching for additional reference material:
http://www.dbms2.com/2010/07/29/how-should-somebody-teach-themselves-programming-skills/
Regards,
Al D.
Patrick McFadin on November 3rd, 2012 12:14 am

Do you have any sense of how this will stack up against Apache Drill? It’s clear that Impala is way down the development path in comparison, but I wondering if they will end up in different places.
Curt Monash on November 3rd, 2012 11:16 am

Patrick,

I don’t know as much about Drill/Dremel as I should. More later.
Paper Trail » Blog Archive » Cloudera Impala on November 4th, 2012 9:12 pm

[…] Monash has a writeup (although he does make it sound like no query will return in under one second, which isn’t […]
Shark: Real-time queries and analytics for big data - Strata on November 27th, 2012 12:24 pm

[…] Hadoop. There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers […]
Introduction to Spark, Shark, BDAS and AMPLab | DBMS 2 : DataBase Management System Services on December 13th, 2012 5:54 pm

[…] think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, […]
DBMS development and other subjects | DBMS 2 : DataBase Management System Services on March 18th, 2013 1:31 am

[…] aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well […]
SQL-Hadoop architectures compared | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:05 am

[…] SQL-H and Hadapt (October, 2012) […]
Shark zoom zoom | Velankani Information Systems, Inc on January 31st, 2014 8:02 pm

[…] There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers put […]
Teradata bought Hadapt and Revelytix | DBMS 2 : DataBase Management System Services on July 23rd, 2014 4:29 am

[…] after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

More on Cloudera Impala

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin