Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
- Impala is free open-source code.
- Not in time for the Impala beta, but planned in time for Impala’s general availability is a column-oriented storage option called Trevni. Impala’s best performance will generally come over Trevni.
- Trevni has a variety of block-level compression options. (64 Kb block size.) Columnar compression, especially dictionary, is a roadmap item.
- Support for nested data structures is a roadmap item, both via Trevni and Avro, except that some limited support may be available via Trevni at GA.
- It obviously will be quite a while before Impala or Hadapt have cost-based optimizers (as opposed to rule-based/heuristic). My unsubstantiated guess is that this is more of a problem for complex queries than simple ones.
On the whole, Impala seems less mature or capable than Hadapt. But Impala does have a few countervailing advantages:
- It’s one less thing to pay for.
- It’s one less thing to administer. (Assuming its integration into Hadoop is tight enough to make that true.)
- It could be faster in some use cases (because it will have columnar storage sooner).