July 14, 2011

An odd claim attributed to Mike Stonebraker

This post has a sequel.

Last week, Mike Stonebraker insulted MySQL and Facebook’s use of it, by implication advocating VoltDB instead. Kerfuffle ensued. To the extent Mike was saying that non-transparently sharded MySQL isn’t an ideal way to do things, he’s surely right. That still leaves a lot of options for massive short-request databases, however, including transparently sharded RDBMS, scale-out in-memory DBMS (whether or not VoltDB*), and various NoSQL options. If nothing else, Couchbase would seem superior to memcached/non-transparent MySQL if you were starting a project today.

*The big problem with VoltDB, last I checked, was its reliance on Java stored procedures to get work done.

Pleasantries continued in The Register, which got an amazing-sounding quote from Mike. If The Reg is to be believed — something I wouldn’t necessarily take for granted — Mike claimed that he (i.e. VoltDB) knows how to solve the distributed join performance problem. 

So, it’s Stonebraker against the web. And the difference of option is severe. In May, at a MongoDB developer conference in San Francisco, Mongo creator Dwight Merriman told his audience there was “no way” to do distributed joins in a way that really scales. “I’m not smart enough to do distributed joins that scale horizontally, widely, and are super fast. You have to choose something else. We have no choice but to not be relational,” he said

“You can do distributed transactions, but if you do them with no loss of generality and you do them across a thousand machines, it’s not going to be that fast.”

Stonebraker says precisely the opposite, and in typical fashion, he goes right for the jugular. “I reject what Merriman says out of hand,” he tells The Register. Merriman and his company, 10gen, declined to comment for this story. But Stonebaker says words don’t matter. As much as he likes to wield his opinions, he insists the debate will be decided elsewhere. “Let the bake-off begin,” he crows.

But when last I checked, VoltDB made nowhere near that claim. And well it shouldn’t have. In the fully general case, there’s no way to ensure super distributed join performance other than by throwing lots and lots of gear at the problem. But if you do that, many alternatives are fast. More specialized cases may be a different matter — but there are many fast alternatives for those too.

I imagine there will be use cases for which VoltDB sustains a lead as the truly fastest alternative, similarly-architected competitors perhaps excepted.* But what Mike supposedly said seems quite forward-leaning when compared to technical reality.

*The canonical VoltDB use case is e-commerce in virtual goods, the point of “virtual” being that physical inventory might necessitate costlier kinds of joins.

Comments

20 Responses to “An odd claim attributed to Mike Stonebraker”

  1. Ariel Weisberg on July 14th, 2011 8:07 am

    I don’t know what Mike was trying to say, but it is true that there isn’t any magic that can make distributed joins scale.

    What he might have meant is that not all joins are distributed and that it is is easy to distinguish between the two (what VoltDB and several other parallel/distributed relational systems do). Choosing the document model over the relational model shouldn’t be about eliminating distributed joins.

  2. Vlad Rodionov on July 14th, 2011 12:48 pm

    Web database and RDBMS are totally different technologies for totally different applications. Who wants (and needs) to do distributed joins during web request anyway?

    But this one:
    “The big problem with VoltDB, last I checked, was its reliance on Java stored procedures to get work done.” I did not get frankly. So, what is wrong with Java?

  3. J Martin on July 14th, 2011 2:50 pm

    Stonebraker and Merriman are probably talking about different things. There are obvious tricks for making distributed joins scale (bloom filters, duplicating lookup tables, prejoining/denormalizing under the covers, etc), perhaps that is what Stonebraker is talking about.

  4. Curt Monash on July 14th, 2011 3:49 pm

    Ariel, J,

    A problem with your theories — the article didn’t quote Dwight as being in line with what you’re suggesting makes sense, and Mike as being diametrically opposed.

    The kindest theory, perhaps, is that the reporter somehow misparaphrased Dwight’s quote to Mike, and that Mike jumped all over the imprecise version.

  5. Curt Monash on July 14th, 2011 3:50 pm

    Vlad,

    If you like doing your database work in stored procedures and in Java, then you’re in the minority, at least among people dealing with VoltDB target use cases.

  6. Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued | DBMS 2 : DataBase Management System Services on July 15th, 2011 3:27 am

    [...] a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts [...]

  7. Fred Holahan on July 15th, 2011 4:14 am

    @Curt

    Actually, we don’t get a ton of pushback on Java stored procedures. Developers looking for high throughput OLTP seem to accept the performance benefits of stored procedures, generally, and Java is rarely a point of friction.

    The guys who consistently balk about SPs are using ORM frameworks like Rails and Hibernate, which makes sense when you think about it. On the other side, though, developers using hot frameworks like Node.js (server-side Javascript) seem completely comfortable with high performance, asynchronous Java SPs.

    While it’s true that Java SPs are occasionally a sticking point, that’s not the case as often as you might think and it’s almost never a deal breaker.

  8. Curt Monash on July 15th, 2011 5:22 am

    Fred,

    I’d suggest that people are self-selecting away from seriously considering VoltDB over the programming model issue. (Issues, actually, because there’s also the relational/schema-free discussion.)

    But yeah — if somebody has the budget to go all-in-RAM, maybe they won’t insist on the most simplistic programming model either.

    And having said that, let me hasten to point out that there’s a big difference between having a big cache on the one hand, and on the other hand putting ALL data in RAM, and having one or more mirrors live in RAM as well.

  9. Fred Holahan on July 15th, 2011 7:37 am

    @Curt

    I suppose it depends on what need is most urgent. If the primary requirement is high throughput transactions at scale, VoltDB is increasingly on peoples’ short lists.

    If the primary need is to fit within a chosen programming model (to your point about relational vs. schemaless), then other options will be favored.

    Couldn’t agree more with your comments about caching vs. use of RAM for all data storage. A common misperception is that VoltDB is trying to be an uber-fast general purpose RDBMS. Although VoltDB can happily accommodate multi-terabyte sized databases (i.e., BIG transactional databases), it’s not really designed to be the only engine in the data tier. That’s why we put so much effort into spooling data to Hadoop and other popular analytic datastores.

    The best applications for VoltDB are ones that require very high performance management of transactions and high-value state data, and also actively use some kind of warehouse for deep analytics. Admittedly, there are far fewer of these use cases today than there are traditional (human-driven) OLTP systems, but we believe it’s a part of the database market that will experience significant growth over the next couple of years. I accept that you may not agree with this vision, but we’re pretty excited about it, and encouraged by VoltDB’s (still early) production installs.

  10. Curt Monash on July 15th, 2011 8:08 am

    Fred,

    I do believe you’re talking about the last part of http://www.dbms2.com/2011/03/30/short-request-and-analytic-processing/ :)

    But anyhow, I’m glad you turn out NOT to be arguing that what Facebook uses MySQL for should be on VoltDB instead.

    Given that — do you have a position as to what they SHOULD be using?

  11. Fred Holahan on July 15th, 2011 10:22 am

    @Curt

    The Facebook guys are dealing with performance, scale and durability challenges that few people (myself included) have ever encountered. I’ve followed Mark Callaghan from a distance for several years and believe him to be one of the most skilled, grounded database devops guys on the planet. I’d bet the same can be said for others on the FB team as well. They are much better qualified to speak to your question than I am.

  12. Mark Callaghan on July 16th, 2011 12:01 am

    I love credit but at Google and especially now at Facebook I have been fortunate to work with extremely talented people. We aren’t going to debate our infrastructure choice in public. It is kind of hard to debate something when only one side knows what is going on. Besides, there are far more interesting topics. While VoltDB might not have the best technical marketing, they have amazing technology. I think they will have durable logs soon which should expand their market. I know a bit about Tokutek even if I can’t follow the math behind fractal trees and that too is awesome technology. They have many performance improvements in progress that I will let them describe. Postgres now has integrated replication and will soon have some kind of sync replication. Finally, MySQL is making rapid progress and I really look forward to the 5.6 release which I hope will have parallel replication apply and even more InnoDB performance enhancements.

  13. Curt Monash on July 16th, 2011 12:23 am

    @Mark,

    I agree with you about VoltDB’s marketing; I don’t think they’ve identified a message that a whole lot of paying people will (or should) care about.

    That said, it’s not easy. There are a lot of contenders doing smart things for extreme scale and/or performance, and it’s hard to see how more than a few at most can wind up greatly prospering.

  14. Camuel Gilyadov on July 16th, 2011 9:18 am

    In my experience efficient-distributed-join “magic” comes only in one of two forms:

    1. Only one table is sharded/partitioned and all the rest are replicated in their entirety on every single node. Given, VoltDB is intended for OLTP my bet is that it is the case here. Some would argue, tough, that it is not really distributed join, just distributed single table scan with local join on each node.

    2. Choosing low-latency low-overhead high-throughput networking like Infiniband which also came down in price lately. Also in this case many will still argue that since networking is efficient it is really just one big NUMA server and no “distribution” is taking place so it is local join despite a thousand nodes involved :) It worth noting that for above solution to work first “parallel join” issues should be addressed which is not a simple task either.

  15. Curt Monash on July 16th, 2011 7:22 pm

    Basically correct, Camuel, but there also are other forms of co-location. E.g., Akiban tries to replicate the performance benefits of document-oriented NoSQL by co-locating whatever would be in one “document”, with the big difference being that they leave the logical relational model intact.

  16. Is Stonebraker right? Why SQL isn’t the choice du jour for many apps — Cloud Computing News on July 21st, 2011 7:16 pm

    [...] VoltDB, NimbusDB and the other NewSQL databases might not even be the best options. Monash actually takes a pretty harsh stance when it comes to [...]

  17. Is Stonebraker right? Why SQL isn’t the choice du jour for many apps | Information Technology | Technology News on July 21st, 2011 8:50 pm

    [...] VoltDB, NimbusDB and the other NewSQL databases might not even be the best options. Monash actually takes a pretty harsh stance[15] when it comes to [...]

  18. Barry Zane on July 29th, 2011 1:59 pm

    There are clearly a couple of conversations getting intertwined here, but the Merriman’s specific argument that “there’s no way to do distributed joins in a way that really scales” is just not true.

    A correctly architected parallel database performs distributed joins in a fully linear and scalable manner, on commodity hardware.

    ParAccel proved that in 2007 in its first public, audited TPCH benchmarks. The specific goal was to demonstrate performance across a 10:1 range of data size and a 3:1 range of server count.

    Don’t believe it? Do the math. In all three cases, the performance (Qphh) came out to be in the range 6,580 to 6,620 Qphh per server. In other words, there was less than a 1% variance in scaling.

    ParAccel’s software and commodity hardware have both improved; our 2010 benchmark was at 32,922 Qphh per server, running virtualized with VMWare.

    Sun (now part of Oracle) withdrew the 2007 benchmark last week, but the results are still available for 4 months at:

    http://www.tpc.org/tpch/results/tpch_result_detail_withdrawn.asp?id=107102901
    http://www.tpc.org/tpch/results/tpch_result_detail_withdrawn.asp?id=107102902
    http://www.tpc.org/tpch/results/tpch_result_detail_withdrawn.asp?id=107102903

    The newer 2010 result on 40 servers (80 VMs) is:

    http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=110041101

    Scalability, manageability, and extreme performance from only 40 servers.

    I was at a major bank a few months ago, and they reported that their attempt to duplicate the TPCH with Hadoop was 40X slower. In other words, I suppose they would have needed 1600 servers to do the same task. I am curious if anybody else has comparative figures?

    Barry Zane, ParAccel CTO

  19. Curt Monash on July 29th, 2011 4:54 pm

    Barry,

    I think the argument is against the possibility of doing distributed joins in full generality, e.g. when there’s no way to distributed/partition the data on the join key.

    I further think the word “scalable” is being used in two different ways — both scaling out the number of servers and scaling up the number of concurrent requests. I.e., the claim is that if a join to bring back a result set requires requests across the network, and if you do LOTS of joins of that kind, bad stuff happens.

    This is about short-request processing, not an analytic.

  20. MySQL Ecosystem « IT Primer on October 9th, 2011 11:00 pm

    [...] a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.