I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:
- Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
- At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
- While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
- Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
- Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
- The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.
Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.
To understand Parquet, it helps to recall that in HDFS there are big blocks, and then there also are ordinary blocks. The big blocks are the 1 gigabyte units that HDFS manages. These are also at this time the closest thing HDFS has to specific storage locations that systems — e.g. a database management execution engine such as Impala — can refer to. Within these big blocks, Parquet is PAX-like; i.e., it stores entire rows in the same big block, but does so a column at a time. However, the more ordinary-sized blocks that are units of I/O should contain data only from single columns; hence, in most cases it should be possible to retrieve only the specific columns that you want. Parquet’s compression scheme is:
- One big block at time.
- By default, dictionary up to a certain cardinality (I forgot to get the figure) …
- … and RLE (Run-Length Encoding) after that.
- Bit-packing as another option.
I forgot to ask whether Impala can operate on compressed data, but based on its compression scheme I’m guessing the answer is no.
In addition to ordinary tables, Parquet can handle nested data structures, ala Dremel. That is, a field can be array-valued, a cell in the array can itself be array-valued, and so on, with arrays all the way down. (Cloudera told me that Twitter’s data is nested 9 levels deep.) If I understood correctly, none of this interferes with single-valued cells being stored in a columnar way; not coincidentally, I got the impression that at least within each big block, there’s a consistent schema.
As for Impala joins and so on:
- Impala does in-memory hash joins …
- … on either a broadcast or partition (sort/merge) basis.
- Choosing between join algorithms is the one thing Impala’s optimizer can do. (The rest is waiting on better stats/metadata in the base files.)
- Joins are fully parallelized when it makes sense. Any Impala daemon/node can be in charge of client communication for a particular query; otherwise, there’s no special “head” node.
- Different tables can have different replication factors. So in particular, you can replicate a small table to every node.
Other notes on Impala and Parquet include:
- Cloudera said that a total of ~1300 organizations have downloaded Impala, and at least ~50 of them are showing strong evidence of some kind of use (e.g., filing tickets about it).
- The main contributors to Parquet to date are Cloudera, Twitter and a French firm called Criteo.
- Impala does most of SQL 92, but not correlated sub-queries, for which there isn’t much user demand anyway. (I think of correlated sub-queries as something you need to run SAP or PeopleSoft enterprise apps.)
- Impala compiles queries into “assembly language” at run time.
And finally: When I wrote about how hard it is to develop a new DBMS, Impala was the top example I had in mind. I continue to think that Cloudera has a good understanding of the generalities of what it needs to add to Impala, as is demonstrated by them allowing me to list some of the many Impala roadmap items above. But I also think Cloudera has an incomplete appreciation of just how hard some of those development efforts will turn out to be.
- Dan Abadi and Dave DeWitt recently contributed observations about SQL-Hadoop architectures.
- DBMS concurrency is a classic case of Bottleneck Whack-A-Mole.