This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.
Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:
- Two-phase commit (2PC)
- Read-your-writes (RYW) consistency
- Prudent optimism
I shall explain.
Two-phase commit has been around for decades. Its core idea is:
- One node commands other nodes (and perhaps itself) to write data.
- The other nodes all reply “Aye, aye; we are ready and able to do that.”
- The first node broadcasts “Make it so!”
Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.
But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC:
- Writes are attempted on the various nodes a datum is to be replicated to.
- If sufficiently many nodes are good to go — and prove it by sending back the same value for the datum — the write succeeds.
- Later, when the data is read, a second vote is taken to see if the read should be accepted as valid.
There are more ways you can get errors in a RYW database than a 2PC one; but in the mean time, you’re less likely to have your writes blocked.
Another problem with 2PC, however, is shared by RYW consistency — both require a lot of network chatter. Consequently, some data management systems rely on a third alternative, which I’m calling prudent optimism. In that approach:
- Node health is monitored out-of-band from the data writing operation.
- Data is written to nodes that are believed to be up.
- If the belief was wrong, recovery should ensue.
Over the past week, I’ve asked how replica consistency is handled in a number of analytic data managers, namely DB2, Netezza, Vertica, Aster, and Hadoop. (My bad for not including Teradata Classic.) It turns out that many different approaches to replica consistency come into play. Specifically, as best I understand:
- DB2 uses 2PC.
- Aster uses a “heavily modified” form of 2PC.
- Vertica and Netezza use forms of prudent optimism in which data is written to all target nodes at once.
- Hadoop uses a form of prudent optimism in which data is streamed to the first node, from there to the second replica, and from there on to the next.
Meanwhile, I’m told that HBase doesn’t have a replication factor, which obviates the problem altogether.
Anyhow — while it’s too early to be sure, I suspect prudent optimism will win, at least for analytic use cases. Mixing network monitoring into a database transaction stream feels like a legacy technique.