A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.
A key point that Daniel seems to be making is that parallel relational database systems are significantly faster than those that rely on the use of MapReduce for their query engines. I totally agree. In fact, several of us have been making the same point for years now (starting with the blog posts that Mike Stonebraker and I wrote more than 5 years ago). Time and time again relational database systems have been shown to be significantly faster. Last year we published a paper comparing PDW (w/o PolyBase) to Hive on two identical clusters (http://vldb.org/pvldb/vol5/p1712_avriliafloratou_vldb2012.pdf). We found that PDW was 3-10 times faster than Hive when executing the TPC-H benchmark.
Cloudera (Impala) and Pivotal (Hawq) seem to have come around to this same viewpoint. While both systems use HDFS for storage, neither uses MapReduce for executing query plans of relational operators. The Impala query engine was written from scratch in C++, and Hawq (apparently) uses a version of the Greenplum query engine adapted to read data directly from HDFS.
Hadapt, like Impala and Hawq, assumes (in general) that there is a single cluster in play and that they (as the DBMS vendor) can elect to use the resources (CPU, memory, and disk) on each node of the cluster in the way they think is best. For example, all three use the CPU and memory resources of the cluster to run a relational DB engine on each node. While Impala and Hawq leave the data in HDFS, Hadapt has concluded that they can get better performance by loading the data into PostgreSQL tables before executing most queries. In the case of Hadapt, one conceptually could think of PostgreSQL instances on each Hadoop “datanode” as a special type of “file format” that has the capability of not only storing local data, but also performing some query execution on that local chunk of data. Otherwise, the overall global execution is based on the traditional MapReduce engine accessing the data either from HDFS or from PostgreSQL. It is interesting that all three systems have, to varying degrees, concluded that many of their customers are willing to sacrifice the fault tolerance and ultimate scalability that MapReduce provides for performance.
Unlike Hadapt, Impala, and Hawq, in designing and building PolyBase our goal was not to build a general purpose scalable data warehousing solution. For that we already have SQL Server PDW. Rather, our goal was to extend the capabilities of PDW by allowing customers to use inexpensive Hadoop clusters while preserving the same T-SQL interface (which is used by a large number of third party applications and BI tools) to easily query and combine data regardless of where it lives and what format it is in. We expect that customers will primarily use the Hadoop cluster for their “cold” data or as their “digital shoebox”.
In addition, unlike Hadapt, Hawq, and Impala, PDW with PolyBase does not make a single-cluster assumption. Users may have two or more clusters, or two or more regions of the same cluster dedicated to different types of data. A deliberate design decision that we made at the beginning of the project was not to assume any control over a customer’s Hadoop cluster. Rather, PolyBase only assumes that it (1) can read and write HDFS files and (2) can submit MapReduce jobs to the cluster for execution. PolyBase is agnostic about what operating system the nodes of the Hadoop cluster are running, whose Hadoop distribution the cluster is running (Hortonworks, Cloudera, etc.), or whether the cluster is on-premise or in the cloud. We deliberately adopted this approach as we felt that it gave our customers the degree of flexibility that they need to be successful for a wide range of applications. In PolyBase, clusters are treated as first-class citizens by the system; the decision as to where to process parts of a query is determined by the system’s parallel query optimizer, based on the location of the data and capabilities of the cluster (e.g., # of nodes, CPU, memory of the cluster, load, etc.). In some situations, PDW data may be pushed into a Hadoop cluster to do the processing there instead. Furthermore, PolyBase can also be used to seamlessly query Hadoop-only data from two or more distinct clusters (e.g., one on-premise and another in the cloud) without combining with any RDBMS data.