August 26, 2012

How immediate consistency works

This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.

Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:

Two-phase commit (2PC)
Read-your-writes (RYW) consistency
Prudent optimism 🙂

I shall explain.

Two-phase commit has been around for decades. Its core idea is:

One node commands other nodes (and perhaps itself) to write data.
The other nodes all reply “Aye, aye; we are ready and able to do that.”
The first node broadcasts “Make it so!”

Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.

But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC:

Writes are attempted on the various nodes a datum is to be replicated to.
If sufficiently many nodes are good to go — and prove it by sending back the same value for the datum — the write succeeds.
Later, when the data is read, a second vote is taken to see if the read should be accepted as valid.

There are more ways you can get errors in a RYW database than a 2PC one; but in the mean time, you’re less likely to have your writes blocked.

Another problem with 2PC, however, is shared by RYW consistency — both require a lot of network chatter. Consequently, some data management systems rely on a third alternative, which I’m calling prudent optimism. In that approach:

Node health is monitored out-of-band from the data writing operation.
Data is written to nodes that are believed to be up.
If the belief was wrong, recovery should ensue.

Over the past week, I’ve asked how replica consistency is handled in a number of analytic data managers, namely DB2, Netezza, Vertica, Aster, and Hadoop. (My bad for not including Teradata Classic.) It turns out that many different approaches to replica consistency come into play. Specifically, as best I understand:

DB2 uses 2PC.
Aster uses a “heavily modified” form of 2PC.
Vertica and Netezza use forms of prudent optimism in which data is written to all target nodes at once.
Hadoop uses a form of prudent optimism in which data is streamed to the first node, from there to the second replica, and from there on to the next.

Meanwhile, I’m told that HBase doesn’t have a replication factor, which obviates the problem altogether.

Anyhow — while it’s too early to be sure, I suspect prudent optimism will win, at least for analytic use cases. Mixing network monitoring into a database transaction stream feels like a legacy technique.

Categories: Aster Data, Clustering, Hadoop, HBase, IBM and DB2, Netezza, NoSQL, Teradata, Vertica Systems

Subscribe to our complete feed!

Comments

11 Responses to “How immediate consistency works”

shrikanth shankar on August 27th, 2012 2:35 am

Doesnt HBase just write its logs to HDFS and as a result just inherit the same replica consistency mechanism ? I am ignoring HBase replication which is used for DR etc. Also AFAIK (and I know very little here) the HDFS NameNode plays a big deal in tracking the state of files and blocks which are accessible to readers. So the consistency protocol probably also involves a centralized coordinator and not just the nodes forming the pipeline.
Curt Monash on August 27th, 2012 2:50 am

Hi Shrikanth,

You’re not the first person to suggest that the HBase info is oversimplified at best.
Ewan on August 27th, 2012 3:47 am

Why do you need to start defining your own terms for various types of consistency? There is already a well defined set of definitions that should suffice for the introductory stuff you are talking about. See: http://en.wikipedia.org/wiki/Consistency_model
Curt Monash on August 27th, 2012 6:45 am

Ewan,

Even assuming we take Wikipedia as authoritative, I don’t see how that list suffices. Perhaps you could explain where I am mistaken, and how each of the three terms I used in fact is equivalent to an item on that list?

As you do that, please note that one of the terms on that list was in the title to this post, and the point of the post was to break down the different ways it could be achieved.

Thanks,

CAM
Alan Fekete on August 27th, 2012 5:23 pm

I believe that one should not use “immediate consistency” and “read-your-writes” as synonyms. In my understanding, immediate consistency means that any read sees the most recent completed write (except when a write overlaps the period of the read, in which case the read may return the old or new value [or sometimes even other values entirely]). In “Read-your-writes”, an essential word is “your”; this property deals with the case where the reader and writer are in the same thread or session; in that case the reader must see the value that was previously written. If we write in one thread and then read from another, r-y-w allows the read to return a value from an older write (rather than the most recent write), but this is not permitted with immediate consistency.

Two papers (now a bit dated) that explored various consistency models experimentally are http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper15.pdf and http://dx.doi.org/10.1145/2093185.2093186
Another project that simulates the quorum algorithms, using empirical data about the latency distributions, is http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
Curt Monash on August 27th, 2012 6:30 pm

Alan,

There’s nothing inherent in RYW that makes it you-specific. The fundamental idea, oversimplified, is that a write is attempted, and over half the relevant nodes have to agree for it to go through. Later, a read is attempted, and again over half the relevant nodes have to agree. The point is that the only way such agreement can occur is if the value being returned is the correct one.

One caveat is that this just means you have a different kind of blocking condition than you have with 2PC. Another caveat would be the performance of voting-on-read; if those reads are of the bulk/scan type, RYW could cause a lot of performance issues. Or, if you optimize performance by skipping a voting step, then you increase the risk of incorrect answers.
Alan Fekete on August 28th, 2012 1:11 pm

Curt,

I believe that your response to my previous comment is confusing because it doesn’t distinguish between the property observed by clients of the storage, and the mechanism used inside the storage. The mechanism where each read quorum intersects each write quorum (usually done by reading R copies and writing W copies, where R+W>N) ensures that users see immediate consistency and therefore it also ensures the RYW property. However, there are other mechanisms (based on sending reads to the same replica used by a previous write from the same thread/session/process) that can provide the RYW property but provide eventual consistency rather than immediate consistency. That is, when the reader is in a different thread/session from the writer, the reader may see old values, but when reading is in the same thread as the writer, one sees the latest version.

This is discussed in detail in Vogel’s CACM paper on Eventual Consistency http://dx.doi.org/10.1145/1435417.1435432
Vogels says (on page 42) “Read-your-writes consistency. This is an important model where process A, after having updated a data item, always accesses the updated value and never sees an older value. This is a special case of the causal consistency model.” and (on p 44) “Whether or not read-your-write, session, and monotonic consistency can be achieved depends in general on the “stickiness” of clients to the server that executes the distributed protocol for them. If this is the same server every time, then it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly more dif- ficult to manage load balancing and fault tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.
Sometimes the client implements read-your-writes and monotonic reads. By adding versions on writes, the client discards reads of values with versions that precede the last-seen version.”
Curt Monash on September 2nd, 2012 7:36 pm

Alan,

I think you’re right. I was conflating a couple of different things.

Still, quorum methods are at the heart of providing the appearance of consistency without all the pain that full consistency might entail.
Jerry Leichter on September 26th, 2012 7:53 am

Curt,

Your final paragraph in your response to Alan again misses his point. You fall back to a notion of “appearance of consistency vs. full consistency”, but I contend it would be impossible to define what that really means.

I’m not sure anyone makes such a thing, but imagine a hybrid disk drive which caches writes in battery-backed RAM and eventually flushes it to disk, with a guaranteed that in the event of a power failure it can always make it to disk. If your data is just in the cache, is it “on disk aka durable storage”?
This example may seem to be unrelated, but it’s fundamental: Every property of a system ultimately has to live at some level of abstraction within the system. Asking questions at the wrong level of abstraction leads no nonsense.

Consider the following algorithm for a database: There are N participating nodes, each of which has a clock. The resolution of the clocks is shorter than the cycle time of the processors, and the clocks are synchronized to at least the cycle time of the processors. Each node also has a unique id number. To write, a node simply writes its local copy, using its local clock to time-stamp the data. To read, a node gets the value from every other node, and chooses the value with the highest timestamp; in case two values have the same timestamp, it chooses the one that came from the node with the highest id number.

Such a database, as seen by a user, obeys the strongest consistency properties. But if you as an outside observer break though the layer of abstraction and look at the individual disks, they are badly *in*consistent. There could well be data items that show a different value at each of the N disks! The values on individual disks aren’t even eventually consistent – they have no nice properties at all.

Consistent? Or just appearing to be consistent?

Granted, this is not an algorithm anyone would be likely to use in practice. Writes are as fast as the local disk, but reads are slow. All nodes have to be up to be able to read anything – though, of course, you can continue to write as long as even a single node is up! You have no actual replication, so durability is that of a single node.

The point of the algorithm is not usefulness, but to show that replication, durability, and availability are independent of consistency. Further, consistency can be available at one layer of abstraction even as it’s routinely violated at another. Practical implementations can overload a mechanism to obtain multiple properties at the same time, but they don’t have to.

— Jerry
“Real-time” is getting real | DBMS 2 : DataBase Management System Services on September 7th, 2016 9:36 am

[…] except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have […]
SQLite vs MySQL: 16 Must-Know Differences Between The Two Leading Databases – PC TIPS ONE on March 5th, 2019 7:54 pm

[…] follows the concept of immediate consistency. SQLite doesn’t follow a consistency […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

How immediate consistency works

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin