February 25, 2013

Greenplum HAWQ

My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.

The basic idea seems to be much like what I mentioned a few days ago  — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.

and

In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.

The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.

Comments

14 Responses to “Greenplum HAWQ”

  1. Sébastien Derivaux on February 25th, 2013 5:25 pm

    HAWQ is not really another Hadoop distribution. The big part of the product is not HDFS but Greenplum/PostgreSQL. I would say that the Hadoop part is mainly marketing. Still, how many other products allow you to use window function on hadoop?

    It will probably be a bit less effective than a straight Greenplum DB but with a seamlessly integration with other storage formats of Hadoop.

    Still, a lot of question remain open. Selling a probably expensive analytic DBMS in a market where things should be cheap seems quite counter intuitive.

  2. Curt Monash on February 25th, 2013 6:15 pm

    Greenplum says there is ALSO a Hadoop distribution.

    Hadapt is the obvious alternative for full-featured SQL over/through Hadoop on the same cluster but in different files. Differences start with Greenplum’s MPP Postgres being more mature than Hadapt’s — including a columnar option! — but Hadapt’s being better integrated with Hadoop MapReduce.

    If Greenplum actually works over HBase files earlier than Hadapt does, that’s another difference.

  3. GregW on February 26th, 2013 3:44 am

    I’m curious if stored procedures are supported with Hawq? They weren’t for Hadapt last time I checked.

    (The visualization infrastructure we use on top of SQL implicitly favors their use.)

  4. Daniel Abadi on February 26th, 2013 10:14 am

    Hey Curt,

    First of all, I strongly agree with the direction the industry moving. When my lab wrote the HadoopDB paper 5 years ago (which became Hadapt), we were already arguing that it makes more sense to bring SQL to Hadoop than to have two systems with a connector between them. However, even just one year ago, I was still the only one arguing against the connector approach (http://hadapt.com/why-database-to-hadoop-connectors-are-flawed/). Then, finally, Cloudera agreed and released Impala, and now Greenplum is agreeing and releasing Hawq. The SQL-directly-on-Hadoop approach is definitely the way forward, and connectors will eventually die away.

    However, I have to disagree with your statement that the Greenplum approach is similar to invisible loading. The invisible loading algorithm “invisibly” rearranges data from slow, file-system-style data layout to fast, relational data-layout. I.e if you issue queries over the same data set repeatedly, you should see faster performance over time. Hawq’s approach is fundamentally different. The same data is pulled from HDFS into Greenplum’s execution engine every time — i.e. if it’s stored as a flat file in HDFS, then when the query is over, it will still be stored as a flat file in HDFS. Scott Yara (Greenplum’s founder) openly admits that their SQL-on-Hadoop approach is slower than a pure Greenplum MPP database (see his quote in the GigaOM piece on Hawq — gigaom.com/2013/02/25/emc-to-hadoop-competition-see-ya-wouldnt-wanna-be-ya/). If they had invisible loading, then the performance of queries would steadily increase until the performance of regular Greenplum MPP was reached.

    GregW, with our most recent release, Hadapt provides an HDK (Hadapt development kit) with which you can develop arbitrary analytical procedures over a Hadapt DBMS (see http://hadapt.com/hadapt-accelerates-hadoops-move-to-production-with-interactive-applications/).

  5. DJ on March 3rd, 2013 8:26 pm

    It was not clear if HAWQ is a read-only system or can write back to Hadoop.

  6. HAWQ and Pivotal HD – Is it Hadoop? | Database Fog Blog on March 5th, 2013 9:24 am

    […] Greenplum HAWQ (dbms2.com) […]

  7. HAWQ Performance Marketing | Database Fog Blog on March 6th, 2013 10:32 am

    […] Greenplum HAWQ (dbms2.com) Share this:PrintEmailFacebookMoreTwitterDiggStumbleUponLinkedInRedditLike this:Like Loading… Categories: Greenplum, HAWQ Tags: Apache Hadoop, Greenplum, Greenplum Database, Hadoop, HDFS, Hive, Impala, SQL Comments (0) Trackbacks (0) Leave a comment Trackback […]

  8. Mark Burnard on March 11th, 2013 1:32 am

    @DJ, It’s write-enabled for HDFS from an SQL console; that’s part of what HAWK does. In fact it looks to me like the next evolution of Greenplum’s “external tables” for HDFS which support read/write access to HDFS but through an intermediate instance of Greenplum. Pivotal HD seems to just do away with the need to have an instance of Greenplum running, and the entire dataset lives in HDFS. I suppose therefore that it would support stored procs in standard SQL but I’ve not read that far ahead yet…

  9. Krzysztof on March 18th, 2013 10:58 am

    @DA, @MB, This sounds like recently implemented writeable foreign data wrappers in Postgres 9.3 (or any other RDBMS supporting writeable FDW) – if somebody writes HDFS adapter and uses them on say Postgres-XC for distribution (once synced with 9.3 codebase) – what would be the advantage of HAWQ, granted in both cases one must write some mapping logic and some kind of pivoting on the fly is necessary?

  10. DJ on March 21st, 2013 3:34 pm

    @MB, Thanks for the info. Another question, does HAWQ shard on HDFS? HDFS is, by itself, distributed file system, so the bits and pieces are scattered across the cluster. If HAWQ shards/distributes like Greenplum, it would be a moot point, is it? If HAWQ does not shard, how does it achieve the MPP without moving massive bits and pieces across the HDFS cluster?

  11. Brian on September 15th, 2013 7:31 pm

    +1 for DJ’s question above, I would be very interested in knowing how the distribution of data is handled

  12. ravi on April 22nd, 2014 5:42 am

    can i use hawq to query data in greenplum database

  13. Greenplum is being open sourced | DBMS 2 : DataBase Management System Services on February 18th, 2015 9:50 pm

    […] only other bit of newly open-sourced stuff I find interesting is HAWQ. Redis was already open source, and I’ve never been persuaded to care about […]

  14. SQL-Hadoop architectures compared | DBMS 2 : DataBase Management System Services on May 5th, 2015 1:39 am

    […] Dan Abadi regarding Hawq (February, 2013) […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.