June 23, 2013

Impala and Parquet

I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:

Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.

Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.

To understand Parquet, it helps to recall that in HDFS there are big blocks, and then there also are ordinary blocks. The big blocks are the 1 gigabyte units that HDFS manages. These are also at this time the closest thing HDFS has to specific storage locations that systems — e.g. a database management execution engine such as Impala — can refer to. Within these big blocks, Parquet is PAX-like; i.e., it stores entire rows in the same big block, but does so a column at a time. However, the more ordinary-sized blocks that are units of I/O should contain data only from single columns; hence, in most cases it should be possible to retrieve only the specific columns that you want. Parquet’s compression scheme is:

One big block at time.
By default, dictionary up to a certain cardinality (I forgot to get the figure) …
… and RLE (Run-Length Encoding) after that.
Bit-packing as another option.

I forgot to ask whether Impala can operate on compressed data, but based on its compression scheme I’m guessing the answer is no.

In addition to ordinary tables, Parquet can handle nested data structures, ala Dremel. That is, a field can be array-valued, a cell in the array can itself be array-valued, and so on, with arrays all the way down. (Cloudera told me that Twitter’s data is nested 9 levels deep.) If I understood correctly, none of this interferes with single-valued cells being stored in a columnar way; not coincidentally, I got the impression that at least within each big block, there’s a consistent schema.

As for Impala joins and so on:

Impala does in-memory hash joins …
… on either a broadcast or partition (sort/merge) basis.
Choosing between join algorithms is the one thing Impala’s optimizer can do. (The rest is waiting on better stats/metadata in the base files.)
Joins are fully parallelized when it makes sense. Any Impala daemon/node can be in charge of client communication for a particular query; otherwise, there’s no special “head” node.
Different tables can have different replication factors. So in particular, you can replicate a small table to every node.

Other notes on Impala and Parquet include:

Cloudera said that a total of ~1300 organizations have downloaded Impala, and at least ~50 of them are showing strong evidence of some kind of use (e.g., filing tickets about it).
The main contributors to Parquet to date are Cloudera, Twitter and a French firm called Criteo.
Impala does most of SQL 92, but not correlated sub-queries, for which there isn’t much user demand anyway. (I think of correlated sub-queries as something you need to run SAP or PeopleSoft enterprise apps.)
Impala compiles queries into “assembly language” at run time.

And finally: When I wrote about how hard it is to develop a new DBMS, Impala was the top example I had in mind. I continue to think that Cloudera has a good understanding of the generalities of what it needs to add to Impala, as is demonstrated by them allowing me to list some of the many Impala roadmap items above. But I also think Cloudera has an incomplete appreciation of just how hard some of those development efforts will turn out to be.

Related links

Dan Abadi and Dave DeWitt recently contributed observations about SQL-Hadoop architectures.
DBMS concurrency is a classic case of Bottleneck Whack-A-Mole.

Categories: Cloudera, Columnar database management, Data models and architecture, Data warehousing, Database compression, Hadoop, HBase, MapReduce, Market share and customer counts, Specific users, SQL/Hadoop integration, Workload management

Subscribe to our complete feed!

Comments

11 Responses to “Impala and Parquet”

Hadoop news and rumors, June 23, 2013 | DBMS 2 : DataBase Management System Services on June 23rd, 2013 11:51 pm

[…] a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less […]
Mark Callaghan on June 24th, 2013 11:59 am

I see more than a few correlated subqueries run on MySQL, I assume these are more popular elsewhere and those queries will benefit from a good optimizer. How many of the HDFS-based query processing frameworks need a lot of optimizer work?
Marcel Kornacker on June 24th, 2013 11:58 pm

Hi Mark,

I agree that to some extent the absence of a lot of correlated subqueries is caused by the lack of support for them. However, I also want to point out that this is not high on the list of features that are frequently requested by our customers.

Regarding your (rhetorical?) question about optimizers: cost-based optimization techniques would no doubt also be useful in the context of Hadoop query engines; more of that is certainly on the roadmap for Impala.

But the absence of this feature also has something to do with the absence of traditional table statistics. Gathering those in a Hadoop environment is a bit more challenging than in your average RDBMS, because there is no single gatekeeper through which all incoming data needs to be funneled. This is not an insurmountable problem, but it illustrates that we need to do more than just “reinvent the wheel”.
Marcel Kornacker on June 25th, 2013 2:27 am

Curt, you start out by saying that “Impala is meant to someday be a competitive MPP analytic RDBMS”, but that is not why Impala was created. The goal of Impala is to provide general SQL querying capability for the Hadoop ecosystem in an efficient manner. To that end, it already exceeds the capabilities of your average analytic DBMS in some aspects: you can combine data stored in multiple physical formats into a single logical table and query it efficiently with Impala. This matters, because it doesn’t force users to arrange everything around a single, central DBMS; instead users can create data using whatever framework and storage format is most appropriate for the task at hand. Having Parquet available means that your data can eventually be transformed into the most efficient physical format, but you can start querying it as soon as it shows up as, say, a csv file written by your web app.

Whether Impala in its current shape is “immature” is a matter of opinion, since it depends on a frame of reference. But the fact that we have more than just handful of customers who are actively submitting support tickets for it shows that it already provides useful functionality.
Curt Monash on June 25th, 2013 2:44 am

Marcel,

I get and support the Innovator’s Dilemma suggestion that you’re doing something different from the MPP RDBMS it will take you a long time to catch up with. And I thank you for spelling it out so emphatically.

Evsn so, I stand by what I wrote. 🙂
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 7th, 2013 2:52 am

[…] Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet. […]
Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:04 am

[…] — Hive, Impala, Stinger, Shark and so on (including […]
Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 4:09 pm

[…] Impala-loving Cloudera doesn’t plan to support Shark. Duh. […]
Cloudera, Impala, data warehousing and Hive | DBMS 2 : DataBase Management System Services on April 30th, 2014 10:03 pm

[…] Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.) […]
marees on June 4th, 2015 6:35 am

Does Impala support ORC file format? why not? What is the difference between ORC & Parquet?

Latest version of HIVE (0.14 & 1.0.0) supports updates/deletes if the HIVE table is in ORC format.
software development on July 1st, 2024 7:17 pm

It’s hard to find educated people for this topic, but you sound like you
know what you’re talking about! Thanks

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Impala and Parquet

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin