Comments on: How much state is saved when an MPP DBMS node fails?

By: Shawn Fox

Shawn Fox — Wed, 27 May 2009 18:43:03 +0000

With Netezza when a SPU fails during query execution all select statements which have not started returning data get restarted from the beginning.

Any data loads are killed and must be restarted manually.

Select statements which have started returning data are killed and must be restarted manually.

Fortunately failures are not very common so it is rarely an issue.

By: Omer Trajman

Omer Trajman — Fri, 15 May 2009 00:52:30 +0000

Currently when a Vertica node fails the cluster will cancel any statements that are in flight and the user is immediately able to re-run them. Transactions that are in flight get preserved (i.e. we keep transaction state). The user receives a statement level error, similar to what they would get in the case of a lock timeout or if there is a policy based quota or resource rejection. Feedback from our users is that this model makes development very simple. They don’t need to handle any special cases that a node has failed. When a statement fails they can just re-run it.

Of course if the machine you are connected to fails then the transactions that it initiated are rolled back and the user needs to connect to a new machine. Since all nodes in a Vertica cluster are peers, users can connect to any node and new connections are rerouted to a live node automatically when using load balancing software.

By: Joe Hellerstein

Joe Hellerstein — Thu, 14 May 2009 15:01:40 +0000

Daniel:
Google’s MR paper (which Hadoop folks follow closely) sez:

1) Mappers write outputs to their local disk. (And some Map jobs are done redundantly on >1 machine to mitigate “stragglers”)
2) Reducers fetch from mappers over the network some time later
3) Reducers write their outputs to the distributed filesystem (triply replicated)

So in terms of resource consumption (energy, disk utilization) that’s a lot of I/O.

Your points are on target: some reads can be absorbed by filesystem cache, and there is overlap of CPU, disk I/O, and network I/O. The latter only affects completion time, not resource consumption.

By: Daniel Weinreb

Daniel Weinreb — Thu, 14 May 2009 14:19:44 +0000

Joe, everything you’re saying sounds right.

But why would Hadoop have to do reread “over and over”? I thought it only reads that data if it’s recovering from a crash.

Also, is it possible that the Hadoop node writing the checkpoint could be doing the disk writing and its computation in parallel, which would reduce the cost of the checkpointing to at least some, and perhaps a great, degree?

By: Joe Hellerstein

Joe Hellerstein — Thu, 14 May 2009 08:41:40 +0000

Roughly speaking, most MPP databases use dataflow pipelines via Graefe's famous Exchange model. Those pipelines reflect an extremely optimistic view or reliability, and expensive restarts of deep dataflow pipelines in the case of even a single fault. By contrast, Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. As a result, it's easy to construct cases where a traditional MPP DB would do no more I/O than the scan of the inputs, whereas a Hadoop job might need to write and reread stages of the pipeline over and over. (I describe this to my undergrads as vomiting the data to disk just to immediately swallow it back up.) The best answer probably lies either in between, or in an entirely different approach. More on that question in this post.

By: Serge Rielau

Serge Rielau — Wed, 13 May 2009 11:20:17 +0000

As far as DB2 for LUW is concerned if a query fails in an DPF environment it needs to get restarted. Whether the reason is a node failure or something else is irrelevant.
No intermediate results are rescued – with the obvious exception of a primed bufferpool of course.

By: Curt Monash

Curt Monash — Tue, 12 May 2009 21:04:30 +0000

LOL, Robert!

By: Curt Monash

Curt Monash — Tue, 12 May 2009 21:04:22 +0000

Steve Wooledge of Aster took a crack at this question over in the Facebook/Hadoop/Hive thread.

By: Masroor Rasheed

Masroor Rasheed — Tue, 12 May 2009 20:45:53 +0000

If a node fails in a clique there is a performance plenty of any query or load job. Once the node is recovered, it will join the MPP.
It is important to know what level of fault tolerance is built in place.

By: Robert Morton

Robert Morton — Tue, 12 May 2009 19:44:14 +0000

A MPP RDBMS built by Chuck Norris never experiences node failures.