May 12, 2009

How much state is saved when an MPP DBMS node fails?

Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:

My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?

Honestly, I’d just be guessing at the answer.

Would any vendors or other knowledgeable folks care to take a crack at answering directly?


10 Responses to “How much state is saved when an MPP DBMS node fails?”

  1. Robert Morton on May 12th, 2009 3:44 pm

    A MPP RDBMS built by Chuck Norris never experiences node failures.

  2. Masroor Rasheed on May 12th, 2009 4:45 pm

    If a node fails in a clique there is a performance plenty of any query or load job. Once the node is recovered, it will join the MPP.
    It is important to know what level of fault tolerance is built in place.

  3. Curt Monash on May 12th, 2009 5:04 pm

    Steve Wooledge of Aster took a crack at this question over in the Facebook/Hadoop/Hive thread.

  4. Curt Monash on May 12th, 2009 5:04 pm

    LOL, Robert!

  5. Serge Rielau on May 13th, 2009 7:20 am

    As far as DB2 for LUW is concerned if a query fails in an DPF environment it needs to get restarted. Whether the reason is a node failure or something else is irrelevant.
    No intermediate results are rescued – with the obvious exception of a primed bufferpool of course.

  6. Joe Hellerstein on May 14th, 2009 4:41 am

    Roughly speaking, most MPP databases use dataflow pipelines via Graefe’s famous Exchange model. Those pipelines reflect an extremely optimistic view or reliability, and expensive restarts of deep dataflow pipelines in the case of even a single fault.

    By contrast, Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. As a result, it’s easy to construct cases where a traditional MPP DB would do no more I/O than the scan of the inputs, whereas a Hadoop job might need to write and reread stages of the pipeline over and over. (I describe this to my undergrads as vomiting the data to disk just to immediately swallow it back up.)

    The best answer probably lies either in between, or in an entirely different approach. More on that question in this post.

  7. Daniel Weinreb on May 14th, 2009 10:19 am

    Joe, everything you’re saying sounds right.

    But why would Hadoop have to do reread “over and over”? I thought it only reads that data if it’s recovering from a crash.

    Also, is it possible that the Hadoop node writing the checkpoint could be doing the disk writing and its computation in parallel, which would reduce the cost of the checkpointing to at least some, and perhaps a great, degree?

  8. Joe Hellerstein on May 14th, 2009 11:01 am

    Google’s MR paper (which Hadoop folks follow closely) sez:

    1) Mappers write outputs to their local disk. (And some Map jobs are done redundantly on >1 machine to mitigate “stragglers”)
    2) Reducers fetch from mappers over the network some time later
    3) Reducers write their outputs to the distributed filesystem (triply replicated)

    So in terms of resource consumption (energy, disk utilization) that’s a lot of I/O.

    Your points are on target: some reads can be absorbed by filesystem cache, and there is overlap of CPU, disk I/O, and network I/O. The latter only affects completion time, not resource consumption.


  9. Omer Trajman on May 14th, 2009 8:52 pm

    Currently when a Vertica node fails the cluster will cancel any statements that are in flight and the user is immediately able to re-run them. Transactions that are in flight get preserved (i.e. we keep transaction state). The user receives a statement level error, similar to what they would get in the case of a lock timeout or if there is a policy based quota or resource rejection. Feedback from our users is that this model makes development very simple. They don’t need to handle any special cases that a node has failed. When a statement fails they can just re-run it.

    Of course if the machine you are connected to fails then the transactions that it initiated are rolled back and the user needs to connect to a new machine. Since all nodes in a Vertica cluster are peers, users can connect to any node and new connections are rerouted to a live node automatically when using load balancing software.

  10. Shawn Fox on May 27th, 2009 2:43 pm

    With Netezza when a SPU fails during query execution all select statements which have not started returning data get restarted from the beginning.

    Any data loads are killed and must be restarted manually.

    Select statements which have started returning data are killed and must be restarted manually.

    Fortunately failures are not very common so it is rarely an issue.

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.