MapReduce/Hadoop fans sometimes raise the question of query fault-tolerance. That is — if a node fails, does the query need to be restarted, or can it keep going? For example, Daniel Abadi et al. trumpet query fault-tolerance as one of the virtues of HadoopDB. Some of the scientists at XLDB spoke of query fault-tolerance as being a good reason to leave 100s or 1000s of terabytes of data in Hadoop-managed file systems.
- Hadoop generally has query fault-tolerance. Intermediate result sets are materialized, and data isn’t tied to nodes anyway. So if a node goes down, its work can be sent to another node.
- Hive actually did not have query fault-tolerance at that time, but it was on the roadmap. (Edit: Actually, it did within a single MapReduce job. But one Hive job can comprise several rounds of MapReduce.)
- Most DBMS vendors do not have query fault-tolerance. If a query fails, it gets restarted from scratch.
- Aster Data’s nCluster, however, does appear to have some kind of query fault-tolerance.
This raises an obvious (pair of) question(s) — why and/or when would anybody ever care about query fault-tolerance? To start answering that, let’s propose a simple metric for query execution time on a grid or cluster: node-hours = (duration of query) * (number of nodes it executes on). Then, to a first approximation, it would seem:
- If the expected number of node-hours for your queries is of the same order of magnitude as or higher than the node MTTF (Mean Time To Failure) of a node, query fault-tolerance matters a lot, because without you’ll get a whole lot of retries.
- If the expected number of node-hours for your queries is one order of magnitude down less than the node MTTF, query fault-tolerance has a significant impact on performance, but you can do without it.
- If the expected number of node-hours is even lower than that, query fault-tolerance is pretty irrelevant.
- Using cheap/low-end/commodity hardware — as Hadoop fans like to do — increases the (node-hours)/MTTF ratio in two ways — it both increases the numerator and decreases the denominator.
Frankly, I’m not too clear on the use cases for which query fault-tolerance really matters. Still, as noted above, it is indeed coming up more and more. I’m not aware that it would be terribly hard for DBMS vendors to, as an option, let DBAs specify execution plans that force intermediate query materialization and further to use that materialization as the basis for query fault-tolerance. Perhaps doing so should be on some technical roadmaps.