The genesis of this post is:
- Dave DeWitt sent me a paper about Microsoft Polybase.
- I argued with Dave about the differences between Polybase and Hadapt.
- I asked Daniel Abadi for his opinion.
- Dan agreed with Dave, in a long email …
- … that he graciously permitted me to lightly-edit and post.
I love my life.
Per Daniel (emphasis mine):
I basically agree with what MSFT says in the paper about how Polybase differs from Hadapt. Obviously at a high level they are similar — both systems can access data stored in Hadoop and split query processing between the DBMS and Hadoop. But once you start to get into the details, the systems diverge dramatically. Polybase is much more similar to Greenplum and Aster’s integration with Hadoop (in the way it uses external tables to connect to Hadoop) than it is to Hadapt. In fact, I would summarize Polybase as a more sophisticated version of SQL-H (but without the HCatalog integration), with more query processing pushdown to Hadoop, and a cost-based optimizer that rigorously decides when and how much query processing to push to Hadoop.
The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.
HOWEVER, take a look at page 10 of the paper. There are 6 graphs on this page (the same trend is shown on all graphs of the paper, but it is most obvious for the 6 graphs on page 10). The graphs break down query time as far as where time is spent — the green is time spent in the DBMS, the blue is time spent in Hadoop, and the red is time spent importing the data into the DBMS from Hadoop. It is immediately obvious that the vast majority of time spent in all these graphs is in the (red) import stage. The obvious conclusion is that while Polybase lets you access data in Hadoop, the user would have been far better off if the data had been in the DBMS all along (so that this data importing would not be necessary). The same is true for SQL-H and Hawq — although they do not refer to the process of getting data out of HDFS and into their execution engines as a ‘data import’, there is still a data movement process (with its associated overhead), and it is well known that the performance of the original Aster DBMS and the original Greenplum DBMS are faster than the versions of Aster and Greenplum that access data stored in HDFS.
With Hadapt, the picture is completely different. If the data is structured enough to fit in the DBMS, it is loaded into the DBMS storage on each Hadoop/HDFS node. We pay this load cost in advance (obviously, with invisible loading, the story gets more interesting, but let’s ignore that for now to make the comparison easier). Therefore Hadapt has a one-time load cost, but then every subsequent query does not have to worry about paying the high data import cost that you see in the Microsoft paper.
Hence, you will end up seeing Hadapt and Polybase/SQL-H/Hawq used for very different use cases. Polybase/SQL-H/Hawq are appropriate for accessing data in Hadoop in an exploratory/one-off fashion. If the data is not accessed frequently from DBMS queries, then it is appropriate to leave it in Hadoop and access it via Polybase/SQL-H/Hawq those few times that it is needed. However, if the data is repeatedly accessed from the DBMS, it is clearly far better if you get it out of Hadoop and into their MPP database system.
With Hadapt, the distinction of whether data should be stored in raw HDFS or the DBMS storage on each HDFS node is not based on how often it is accessed, but rather whether the data is structured/semi-structured or “truly unstructured” (I have to use the modifier “truly” for unstructured data now that DeWitt redefined the meaning of “unstructured” in his paper to mean any data in Hadoop :D). Structured/semi-structured data goes in the DBMS storage and (truly) unstructured data goes in raw HDFS. Through Hadapt’s SQL extensions for unstructured data (including full text search), queries that span HDFS and the DBMS are not simply about importing data from HDFS into the DBMS like in the Polybase paper, but rather about true structured/unstructured query integration.
Hadapt uses MapReduce, not for the first part of query processing in order to reduce data import cost (like Polybase/SQL-H). Rather, it uses MapReduce to manage long running queries in order to get dynamic query scheduling and runtime fault tolerance. Even if all data that is accessed by the query is already in the DBMS storage, Hadapt might still use the MapReduce engine if the query is predicted to be long-running. (For short queries, as you already know, Hadapt doesn’t use MapReduce, but instead uses the new IQ engine.). This is fundamentally different from Polybase/SQL-H/Hawq — none of which use Hadoop’s MapReduce engine if all the input data is already in the DBMS.
- Dave DeWitt’s response to this post (June, 2013)
- SQL-H and Hadapt (October, 2012)
- Cloudera Impala (November, 2012)
- Dan Abadi regarding Hawq (February, 2013)