November 1, 2012

More on Cloudera Impala

What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.

First, some basics:

The general technical idea of Impala is:

*With no “fat head”.

Impala is of course a young system, and very much a work in progress. It has a variety of limitations in functionality, performance, and so on, many (all?) of which are slated to be addressed down the road. While different individuals may espouse different views at different times, I think it’s not too misleading to summarize Cloudera’s strategic positioning for Impala as:

Thinking about Impala performance is confusing, on any level of detail beyond:

But let’s try anyway.

As of the initial Impala release(s):

Cloudera’s marketing name for Impala will be “Real Time Query”, but seems a dubious match to early-release Impala reality.

In many cases, the best Impala performance — and indeed the best Hadoop performance overall — will probably come over Trevni, which Cloudera believes will be 30% or so faster than the current columnar option RCFile. This led me to inquire how data would get into Trevni, presuming that it’s initially loaded into some other format. Cloudera is hoping to have a background process for that available Day 1, but I have no details about it. (The other alternative would be to do a batch MapReduce job.) Cloudera also points out that both Flume and HBase can get data into Hadoop with very low latency.

Given the obvious potential synergy between Impala — a specialized alternative to MapReduce — and YARN, Cloudera has redoubled its efforts to (help) get YARN up to production quality.

Finally, there’s the question of what Impala actually does. In its initial release, it will support a large, strict subset of Hive functionality. That helps with reusing a lot of Hive infrastructure and connectivity, of course. But it also means that you don’t have real updates; rather, you load in bulk. Similarly, there’s a lot of analytic SQL functionality that’s not directly supported. Down the road, it’s reasonable to expect Impala functionality to extend in (at least) two directions:

Comments

12 Responses to “More on Cloudera Impala”

  1. Notes and comments — October 31, 2012 | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:13 am

    […] 4. Stay tuned for more on Cloudera Impala. (Edit: Now posted.) […]

  2. Quick notes on Impala | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:17 am

    […] There is now a follow-up post on Cloudera Impala with substantially more […]

  3. Al DeLosSantos on November 1st, 2012 11:02 am

    Thanks again to your and your audience for very helpful posts and comments Curt. I also found this historical post and discussion thread useful when I was searching for additional reference material:
    http://www.dbms2.com/2010/07/29/how-should-somebody-teach-themselves-programming-skills/
    Regards,
    Al D.

  4. Patrick McFadin on November 3rd, 2012 12:14 am

    Do you have any sense of how this will stack up against Apache Drill? It’s clear that Impala is way down the development path in comparison, but I wondering if they will end up in different places.

  5. Curt Monash on November 3rd, 2012 11:16 am

    Patrick,

    I don’t know as much about Drill/Dremel as I should. More later.

  6. Paper Trail » Blog Archive » Cloudera Impala on November 4th, 2012 9:12 pm

    […] Monash has a writeup (although he does make it sound like no query will return in under one second, which isn’t […]

  7. Shark: Real-time queries and analytics for big data - Strata on November 27th, 2012 12:24 pm

    […] Hadoop. There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers […]

  8. Introduction to Spark, Shark, BDAS and AMPLab | DBMS 2 : DataBase Management System Services on December 13th, 2012 5:54 pm

    […] think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, […]

  9. DBMS development and other subjects | DBMS 2 : DataBase Management System Services on March 18th, 2013 1:31 am

    […] aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well […]

  10. SQL-Hadoop architectures compared | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:05 am

    […] SQL-H and Hadapt (October, 2012) […]

  11. Shark zoom zoom | Velankani Information Systems, Inc on January 31st, 2014 8:02 pm

    […] There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers put […]

  12. Teradata bought Hadapt and Revelytix | DBMS 2 : DataBase Management System Services on July 23rd, 2014 4:29 am

    […] after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.