June 6, 2013

Dave DeWitt responds to Daniel Abadi

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

A key point that Daniel seems to be making is that parallel relational database systems are significantly faster than those that rely on the use of MapReduce for their query engines. I totally agree. In fact, several of us have been making the same point for years now (starting with the blog posts that Mike Stonebraker and I wrote more than 5 years ago). Time and time again relational database systems have been shown to be significantly faster. Last year we published a paper comparing PDW (w/o PolyBase) to Hive on two identical clusters (http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf). We found that PDW was 3-10 times faster than Hive when executing the TPC-H benchmark.

Cloudera (Impala) and Pivotal (Hawq) seem to have come around to this same viewpoint. While both systems use HDFS for storage, neither uses MapReduce for executing query plans of relational operators. The Impala query engine was written from scratch in C++, and Hawq (apparently) uses a version of the Greenplum query engine adapted to read data directly from HDFS.

Hadapt, like Impala and Hawq, assumes (in general) that there is a single cluster in play and that they (as the DBMS vendor) can elect to use the resources (CPU, memory, and disk) on each node of the cluster in the way they think is best. For example, all three use the CPU and memory resources of the cluster to run a relational DB engine on each node. While Impala and Hawq leave the data in HDFS, Hadapt has concluded that they can get better performance by loading the data into PostgreSQL tables before executing most queries. In the case of Hadapt, one conceptually could think of PostgreSQL instances on each Hadoop “datanode” as a special type of “file format” that has the capability of not only storing local data, but also performing some query execution on that local chunk of data. Otherwise, the overall global execution is based on the traditional MapReduce engine accessing the data either from HDFS or from PostgreSQL. It is interesting that all three systems have, to varying degrees, concluded that many of their customers are willing to sacrifice the fault tolerance and ultimate scalability that MapReduce provides for performance.

Unlike Hadapt, Impala, and Hawq, in designing and building PolyBase our goal was not to build a general purpose scalable data warehousing solution. For that we already have SQL Server PDW. Rather, our goal was to extend the capabilities of PDW by allowing customers to use inexpensive Hadoop clusters while preserving the same T-SQL interface (which is used by a large number of third party applications and BI tools) to easily query and combine data regardless of where it lives and what format it is in. We expect that customers will primarily use the Hadoop cluster for their “cold” data or as their “digital shoebox”.

In addition, unlike Hadapt, Hawq, and Impala, PDW with PolyBase does not make a single-cluster assumption. Users may have two or more clusters, or two or more regions of the same cluster dedicated to different types of data. A deliberate design decision that we made at the beginning of the project was not to assume any control over a customer’s Hadoop cluster. Rather, PolyBase only assumes that it (1) can read and write HDFS files and (2) can submit MapReduce jobs to the cluster for execution. PolyBase is agnostic about what operating system the nodes of the Hadoop cluster are running, whose Hadoop distribution the cluster is running (Hortonworks, Cloudera, etc.), or whether the cluster is on-premise or in the cloud. We deliberately adopted this approach as we felt that it gave our customers the degree of flexibility that they need to be successful for a wide range of applications. In PolyBase, clusters are treated as first-class citizens by the system; the decision as to where to process parts of a query is determined by the system’s parallel query optimizer, based on the location of the data and capabilities of the cluster (e.g., # of nodes, CPU, memory of the cluster, load, etc.). In some situations, PDW data may be pushed into a Hadoop cluster to do the processing there instead. Furthermore, PolyBase can also be used to seamlessly query Hadoop-only data from two or more distinct clusters (e.g., one on-premise and another in the cloud) without combining with any RDBMS data.


5 Responses to “Dave DeWitt responds to Daniel Abadi”

  1. SQL-Hadoop architectures compared | DBMS 2 : DataBase Management System Services on June 6th, 2013 12:11 am

    [...] Dave DeWitt’s response to this post (June, 2013) [...]

  2. Tracking the progress of large-scale Query Engines - Strata on June 6th, 2013 10:04 am

    [...] of how PolyBase and Hadapt differ. (Update, 6/6/2013: Dave Dewitt of Microsoft Research, on the design of PolyBase.) (5) To thoroughly compare different systems, a generic benchmark such as the one just released, [...]

  3. Hubi on June 7th, 2013 10:02 pm

    Man those Microsoft folks are smart. Here I am all excited about what I can do with basic reporting and analytics out of the Reporting Services and Analysis Services and they’re bringing Hadoop to the masses.

  4. aaron on June 10th, 2013 9:56 am

    Sarcasm aside, the biggest issue in limiting analytics has been SW license cost. Microsoft disrupted that for reporting, ETL, and in SS2008 socket CPU license for DB, increasing overall analytics. Lightweight tools are better than no tools in many circumstances.

    Granting that PDW is not valuable to most, and Dewitt embraced more than he extended. This is limited, but expect downmarket migration both of PDW and MS Hadoop and more features in this over time.

  5. vstrien on July 17th, 2013 9:23 am

    One thing Abadi argues though is that “Polybase is much more similar to (..) Aster’s integration with Hadoop (in the way it uses external tables to connect to Hadoop) than it is to Hadapt.”

    In this response, dr. Dewitt only looks at the difference with Hadapt though. What are the main differences between Aster, Greenplum and Polybase (other than the seamless integration)?

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.