Daniel Abadi has a theory about ParAccel
When I was at SIGMOD last week, ParAccel and its SIGMOD talk were mentioned several times, always in puzzled and at least slightly unflattering terms. (Typical comment: “Why did they present a paper about that? We were doing the same thing in our company years ago.”) That doesn’t prove much per se, since most of the mentions were by competitors and/or Vertica-affiliated academics, and since my own unflattering ParAccel-related comments were rather fresh at the time.
But now Daniel Abadi has done a brilliant, detailed, speculative analysis of ParAccel’s publications. Here’s the meat, emphasis mine:
(1) Why did they configure their TPC-H application with such a high amount of disk I/O throughput capabilty when they are a column-store? (Stonebraker’s question)
(2) Why did queries spend seemingly 6X more time doing I/O than a column-store should have to do?
(3) Why are they worried about queries with thousands of joins?
(4) Why do they think TPC-H/TPC-DS queries have 42 joins?
And then a theory that answers all four questions at the same time came to me. Perhaps ParAccel directly followed my advice (see option 1) on “How to create a new column-store DBMS product in a week“. They’re not a column-store. They’re a vertically partitioned row-store (this is how column-stores were built back in the 70s before we knew any better). Each column is stored in its own separate table inside the row-store (PostgreSQL in ParAccel’s case). Queries over the original schema are then automatically rewritten into queries over the vertically partitioned schema and the row-store’s regular query execution engine can be used unmodified. But now, every attribute accessed by the query now adds an additional join to the query plan (since the vertical partitions for each column in a table have to be joined together).This immediately explains why they are worried about queries with hundreds to thousands of joins (questions 3 and 4). But it also explains why they seem to be doing much more I/O than a native column-store. Since each vertical partition is its own table, then each tuple in a vertical partition (which contains just one value) is preceded by the row-store’s tuple header. In PostgreSQL this tuple header is on the order of 27 bytes. So if the column width is 4 bytes, then there is a factor of 7 extra space used up for the tuple header relative to actual user data. And if the implementation is super naive, they also will need an additional 4 bytes to store a tuple identifier for joining vertical partitions from the same original table with each other. This answers questions 1 and 2, as the factor of 6 worse I/O efficiency is now obvious.
It will be interesting to see whether ParAccel comments, but even it does, I wouldn’t necessarily take ParAccel’s statements as dispositive. For example — and illustrative of my view of ParAccel’s trustworthiness — I believe ParAccel’s competition who tell me that ParAccel’s claim to have won or at least tied all POCs on performance is flat-out untrue.
Comments
30 Responses to “Daniel Abadi has a theory about ParAccel”
Leave a Reply
I don’t think ParAccel has the option of not responding and addressing this news item. Maybe it turns out they are a hybrid shard/vertical system, but then that would make them a lot like XSPRADA and that would surprise me a GREAT deal! 🙂
The ADBMS world seems split between those who do columns and those who don’t lately (with a strong bias towards the latter as evidenced by the Vertica marketing machine and SAP’s latest paper). I havent seen anyone laying claim to yet another way to architect these systems, besides our own architecture. As we say in French, “au pays des aveugles, les borgnes sont rois” — meaning in the land of the blind, the one-eyed man is king (but it sounds a lot nicer in French! ).
You can be sure they’ll address it in sales situations. Specifically, they’ll probably try to deflect it by pointing out Abadi’s affiliation with Vertica, by misrepresenting the nature of my falling out with them, and so on. And they’ll have salesmen saying “No, that’s not true” who honestly can’t be expected to back up their statements one way or the other.
But I wouldn’t expect the real truth to be widely known until ParAccel’s next round of employee departures.
Are you implying another round of departures (willing or not) are imminent at ParAccel? This would be surprising given their recent funding would it not?
*is* imminent — sorry I’m so anal about grammar 🙂
No. I’m implying we may have to wait a while for the truth.
Hi Curt,
I am afraid Daniel misread the TPC-H results, hence the confusion. The real numbers show that ParAccel is indeed an efficient column store.
For the benefit of your readers, I’m copying my reply to Daniel’s blog:
Hi Daniel,
I enjoyed your post but I am afraid you misread ParAccel’s TPC-H results. Your quoted number of 275 seconds for TPC-H Query 1 is from the “throughput” test that has 10 concurrent streams of queries. What you should be looking at is the “power” test, which is their “Stream ID 00” numbers. You can see that Query 1 takes 51.4 secs which is very close to your calculations. Therefore, it seems to me, they are indeed a column-store (and hopefully this should answer your question #2).
About Stonebraker’s question (your question #1): When it comes to joining two (or more) tables where the projected columns do not fit in your RAM workspace, chances are that you will be using a 2-pass algorithm, and therefore you will be needing high disk I/O throughput. As you know, for these types of ad-hoc joins, column-stores can only achieve modest improvements over row-stores (after you factor out any compression-related benefits — we showed this in our Sigmod 2009 paper on SSDs and column page layouts http://bit.ly/SSD_sigmod09). Whether column-stores really need high disk I/O in real-life workloads, is, of course, a different question.
Your question #4: From the paper, it is clear that they know that TPC-DS does not have 42 tables (it has 24). They explicitly state that they added tables to show the benefits of their algorithm. If you look at they results, they only perform better for a large number of tables, hence the modification of the original specs. I think this is fair in a paper. Is 42 tables realistic? Again, a different question. (And I think they only have 4 pages because it is an industrial paper, which required that they only submit an abstract for reviewing — it is hard to come up with 14 pages between notification and camera-ready deadline, which is less than one month apart).
Finally, about your question #3, why they worry about 1000s of joins? Beats me.
–stavros
[…] yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the […]
Curt,
Thanks for the kind words.
FYI: I have responded to Stavros’ excellent comment in the thread on my blog.
I used to read your blog selectively as a way to weed out the facts. It’s been very helpful in the past (thank you!),but I’m now having my doubts regarding your integrity. Some of your recently stated opinions wrt certain topics/companies often sound like nothing more than venting.
For instance, you clearly have a personal grudge due to some issues yet to be resolved w/ Paraccel employee(s), and any comments you make wrt Paraccel seem to become the real topic. You appeared to parrot Datallegro without hesitation when they were paying you. Now Vertica pays you and you have only positive things to say about them. ParAccel does not pay you and now we see the result on a continual basis.
Can you please just get back to blogging without all the noise? It’s becoming pretty exhausting and annoying to be quite honest.
Perhaps you’re thinking of a different blog than mine? I’ve posted plenty of skeptical or negative things about DATAllegro, Vertica, and many other perennial clients. (DATAllegro never stopped being a client; they just turned into Microsoft. I sent advice email off to Stuart Frost just a few minutes ago.)
And by the way, one of my clear biases is against cowards. At least I have the guts to attach my name to my opinions, some of which turn out to be correct and some of which don’t, and to disclose information that might help people judge my interests, biases, etc. None of that can be said of you, so what in the world gives you the right to comment on my integrity?
[…] arguments for column stores. Barry’s second post, however, was in direct response to Daniel Abadi’s speculation about ParAccel’s architecture. That post also promises a follow-up addressing the TPC-H in a more substantive […]
There has already been a recent – last week – round of layoffs at Paraccel in the engineering group. Not in spite of the new funding, but BECAUSE of the new round of funding.
That seems implausible, and apt to be misleading even if by some chance it is technically true. “$22 million” for purposes that include “to expand development” may not mean exactly what it sounds like in the press release, but it probably also doesn’t mean “cut back on development.”
The last time I heard a lot of fuss about ParAccel layoffs, it turned out to be only 4-5 people or so. At least, that was Kim Stanick’s story, and nobody ever disproved it, nor did the details of the information passed on to me really contradict what she said.
It was 5 people. Fortunately, I was not one of them.
Are you saying the mid-2008 layoff was 5 people, or the current one was, or both?
Thanks,
CAM
This current one – last week.
Ok well that answers my question as to where that rumor came from. I take it Mr. Anonymous is a ParAccel engineer?
That’s the implication. I haven’t checked IP addresses to confirm (no way I could disconfirm, of course).
This tweet says pretty much the same thing about the layoff:
http://twitter.com/i_integr8/status/2558132370
Yeah I see that but this gentleman is based in Austin it seems (and working at Aruna?) so I’m not sure what he’s talking about and who that “CTO friend” he refers to might be.
In either case, if they cut some engineering deadwood post-funding, that’s not really unusual (in my experience anyway) – I would imagine they’re now flying at a pretty stable flight level engineering wise. It’s on the sales/marketing/PR angle I suppose they will hit hard now. Of course this is all speculation on my part.
Oh and 10% being 5 people, meaning they have 45 engineers left now? I’d say that’s probably more than sufficient – Of course I’m biased given what we accomplished at XSPRADA with a 1/6th of that! 🙂 — But then we didn’t have the deep pockets either, and that’s when a company gets reaaally creative.
Anyhow, a small layoff in an apparently well-funded company isn’t a big deal.
Looks like paraccel has taken these questions very seriously. They have answered those here. http://paraccel.com/data_warehouse_blog/?p=57
Rama Krishna Venkata Vedantam
Telecom Technology Excellence Group
Tata Consultancy Services
Yep. Hence my post http://www.dbms2.com/2009/07/08/progress-in-figuring-out-what-paraccel-is-doing/
En español – “en el valle de los ciegos, el tuerto es el rey”
[…] agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned […]
If industrial folks took time to publish research quality papers about the inside of their “fantastic” new results the whole debate would not exist. We have a lot conferences for that. The fact that industrial papers are referred only by abstract at even our best database conferences above mentioned and others is a clear anomaly, rather shameful with respect to the stringent criteria applied at the same conferences to the research.
Folks anyone knows besides how ParAccell technically manages data partitioning ? Hash or range or k-d splits, static/dynamic…?
[…] agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned […]
[…] yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the […]