July 7, 2009

Daniel Abadi has a theory about ParAccel

When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms.  (Typical comment: “Why did they present a paper about that? We were doing the same thing in our company years ago.”) That doesn’t prove much per se, since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own unflattering ParAccel-related comments were rather fresh at the time.

But now Daniel Abadi has done a brilliant, detailed, speculative analysis of ParAccel’s publications.  Here’s the meat, emphasis mine:

(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker’s question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?

And then a theory that answers all four questions at the same time came to me.
Perhaps ParAccel directly followed my advice (see option 1) on “How to create a new column-store DBMS product in a week“. They’re not a column-store. They’re a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel’s case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store’s regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).

This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store’s tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.

It will be interesting to see whether ParAccel comments, but even it does, I wouldn’t necessarily take ParAccel’s statements as dispositive.  For example — and illustrative of my view of ParAccel’s trustworthiness — I believe ParAccel’s competition who tell me that ParAccel’s claim to have won or at least tied all POCs on performance is flat-out untrue.

Comments

30 Responses to “Daniel Abadi has a theory about ParAccel”

  1. Jerome Pineau on July 7th, 2009 7:20 pm

    I don’t think ParAccel has the option of not responding and addressing this news item. Maybe it turns out they are a hybrid shard/vertical system, but then that would make them a lot like XSPRADA and that would surprise me a GREAT deal! 🙂
    The ADBMS world seems split between those who do columns and those who don’t lately (with a strong bias towards the latter as evidenced by the Vertica marketing machine and SAP’s latest paper). I havent seen anyone laying claim to yet another way to architect these systems, besides our own architecture. As we say in French, “au pays des aveugles, les borgnes sont rois” — meaning in the land of the blind, the one-eyed man is king (but it sounds a lot nicer in French! ).

  2. Curt Monash on July 7th, 2009 9:58 pm

    You can be sure they’ll address it in sales situations. Specifically, they’ll probably try to deflect it by pointing out Abadi’s affiliation with Vertica, by misrepresenting the nature of my falling out with them, and so on. And they’ll have salesmen saying “No, that’s not true” who honestly can’t be expected to back up their statements one way or the other.

    But I wouldn’t expect the real truth to be widely known until ParAccel’s next round of employee departures.

  3. Jerome Pineau on July 7th, 2009 11:38 pm

    Are you implying another round of departures (willing or not) are imminent at ParAccel? This would be surprising given their recent funding would it not?

  4. Jerome Pineau on July 7th, 2009 11:39 pm

    *is* imminent — sorry I’m so anal about grammar 🙂

  5. Curt Monash on July 8th, 2009 12:17 am

    No. I’m implying we may have to wait a while for the truth.

  6. Stavros Harizopoulos on July 8th, 2009 1:30 am

    Hi Curt,

    I am afraid Daniel misread the TPC-H results, hence the confusion. The real numbers show that ParAccel is indeed an efficient column store.

    For the benefit of your readers, I’m copying my reply to Daniel’s blog:

    Hi Daniel,

    I enjoyed your post but I am afraid you misread ParAccel’s TPC-H results. Your quoted number of 275 seconds for TPC-H Query 1 is from the “throughput” test that has 10 concurrent streams of queries. What you should be looking at is the “power” test, which is their “Stream ID 00” numbers. You can see that Query 1 takes 51.4 secs which is very close to your calculations. Therefore, it seems to me, they are indeed a column-store (and hopefully this should answer your question #2).

    About Stonebraker’s question (your question #1): When it comes to joining two (or more) tables where the projected columns do not fit in your RAM workspace, chances are that you will be using a 2-pass algorithm, and therefore you will be needing high disk I/O throughput. As you know, for these types of ad-hoc joins, column-stores can only achieve modest improvements over row-stores (after you factor out any compression-related benefits — we showed this in our Sigmod 2009 paper on SSDs and column page layouts http://bit.ly/SSD_sigmod09). Whether column-stores really need high disk I/O in real-life workloads, is, of course, a different question.

    Your question #4: From the paper, it is clear that they know that TPC-DS does not have 42 tables (it has 24). They explicitly state that they added tables to show the benefits of their algorithm. If you look at they results, they only perform better for a large number of tables, hence the modification of the original specs. I think this is fair in a paper. Is 42 tables realistic? Again, a different question. (And I think they only have 4 pages because it is an industrial paper, which required that they only submit an abstract for reviewing — it is hard to come up with 14 pages between notification and camera-ready deadline, which is less than one month apart).

    Finally, about your question #3, why they worry about 1000s of joins? Beats me.

    –stavros

  7. Wow, the TPC-H speculation continues! « Hype Cycles on July 8th, 2009 11:13 am

    […] yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the […]

  8. Daniel Abadi on July 8th, 2009 11:47 am

    Curt,

    Thanks for the kind words.

    FYI: I have responded to Stavros’ excellent comment in the thread on my blog.

  9. Anonymous on July 8th, 2009 4:09 pm

    I used to read your blog selectively as a way to weed out the facts. It’s been very helpful in the past (thank you!),but I’m now having my doubts regarding your integrity. Some of your recently stated opinions wrt certain topics/companies often sound like nothing more than venting.

    For instance, you clearly have a personal grudge due to some issues yet to be resolved w/ Paraccel employee(s), and any comments you make wrt Paraccel seem to become the real topic. You appeared to parrot Datallegro without hesitation when they were paying you. Now Vertica pays you and you have only positive things to say about them. ParAccel does not pay you and now we see the result on a continual basis.

    Can you please just get back to blogging without all the noise? It’s becoming pretty exhausting and annoying to be quite honest.

  10. Curt Monash on July 8th, 2009 7:30 pm

    Perhaps you’re thinking of a different blog than mine? I’ve posted plenty of skeptical or negative things about DATAllegro, Vertica, and many other perennial clients. (DATAllegro never stopped being a client; they just turned into Microsoft. I sent advice email off to Stuart Frost just a few minutes ago.)

    And by the way, one of my clear biases is against cowards. At least I have the guts to attach my name to my opinions, some of which turn out to be correct and some of which don’t, and to disclose information that might help people judge my interests, biases, etc. None of that can be said of you, so what in the world gives you the right to comment on my integrity?

  11. Progress in figuring out what ParAccel is doing | DBMS2 -- DataBase Management System Services on July 8th, 2009 7:46 pm

    […] arguments for column stores. Barry’s second post, however, was in direct response to Daniel Abadi’s speculation about ParAccel’s architecture.  That post also promises a follow-up addressing the TPC-H in a more substantive […]

  12. Anonymous on July 9th, 2009 2:26 am

    There has already been a recent – last week – round of layoffs at Paraccel in the engineering group. Not in spite of the new funding, but BECAUSE of the new round of funding.

  13. Curt Monash on July 9th, 2009 2:41 am

    That seems implausible, and apt to be misleading even if by some chance it is technically true. “$22 million” for purposes that include “to expand development” may not mean exactly what it sounds like in the press release, but it probably also doesn’t mean “cut back on development.”

    The last time I heard a lot of fuss about ParAccel layoffs, it turned out to be only 4-5 people or so. At least, that was Kim Stanick’s story, and nobody ever disproved it, nor did the details of the information passed on to me really contradict what she said.

  14. Anonymous on July 9th, 2009 3:25 am

    It was 5 people. Fortunately, I was not one of them.

  15. Curt Monash on July 9th, 2009 4:39 am

    Are you saying the mid-2008 layoff was 5 people, or the current one was, or both?

    Thanks,

    CAM

  16. Anonymous on July 9th, 2009 11:06 am

    This current one – last week.

  17. Jerome Pineau on July 9th, 2009 3:04 pm

    Ok well that answers my question as to where that rumor came from. I take it Mr. Anonymous is a ParAccel engineer?

  18. Curt Monash on July 9th, 2009 4:30 pm

    That’s the implication. I haven’t checked IP addresses to confirm (no way I could disconfirm, of course).

  19. Curt Monash on July 9th, 2009 6:53 pm

    This tweet says pretty much the same thing about the layoff:

    http://twitter.com/i_integr8/status/2558132370

  20. Jerome Pineau on July 9th, 2009 7:21 pm

    Yeah I see that but this gentleman is based in Austin it seems (and working at Aruna?) so I’m not sure what he’s talking about and who that “CTO friend” he refers to might be.

    In either case, if they cut some engineering deadwood post-funding, that’s not really unusual (in my experience anyway) – I would imagine they’re now flying at a pretty stable flight level engineering wise. It’s on the sales/marketing/PR angle I suppose they will hit hard now. Of course this is all speculation on my part.

  21. Jerome Pineau on July 9th, 2009 7:23 pm

    Oh and 10% being 5 people, meaning they have 45 engineers left now? I’d say that’s probably more than sufficient – Of course I’m biased given what we accomplished at XSPRADA with a 1/6th of that! 🙂 — But then we didn’t have the deep pockets either, and that’s when a company gets reaaally creative.

  22. Curt Monash on July 9th, 2009 8:24 pm

    Anyhow, a small layoff in an apparently well-funded company isn’t a big deal.

  23. Ramakrishna Vedantam on July 10th, 2009 7:10 am

    Looks like paraccel has taken these questions very seriously. They have answered those here. http://paraccel.com/data_warehouse_blog/?p=57

    Rama Krishna Venkata Vedantam
    Telecom Technology Excellence Group
    Tata Consultancy Services

  24. Curt Monash on July 10th, 2009 1:54 pm
  25. Chano Ochoa on July 11th, 2009 11:06 am

    En español – “en el valle de los ciegos, el tuerto es el rey”

  26. Faster or Free « Hype Cycles on September 22nd, 2009 10:25 am

    […] agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned […]

  27. Witold Litwin on March 31st, 2010 3:40 pm

    If industrial folks took time to publish research quality papers about the inside of their “fantastic” new results the whole debate would not exist. We have a lot conferences for that. The fact that industrial papers are referred only by abstract at even our best database conferences above mentioned and others is a clear anomaly, rather shameful with respect to the stringent criteria applied at the same conferences to the research.

  28. Witold Litwin on March 31st, 2010 3:44 pm

    Folks anyone knows besides how ParAccell technically manages data partitioning ? Hash or range or k-d splits, static/dynamic…?

  29. Faster or Free | Pizza And Code on October 7th, 2011 4:25 am

    […] agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned […]

  30. Wow, the TPC-H speculation continues! | Pizza And Code on October 7th, 2011 4:28 am

    […] yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.