The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.
Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:
- Clusters will have node hiccups more often than single nodes will. (Duh.)
- Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
- In many applications, it’s OK to sacrifice even basic-seeming database functionality.
And so there’s been innovation in numerous cluster-related subjects, two of which are:
- Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
- Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
- Start over from the beginning?
- Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?
Distributed database consistency
When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:
- Analytic RDBMS innovation.
- Short-request applications designed to avoid distributed joins.
- Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.
But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as:
- A write is planned and parceled out to occur on all the different nodes where the data needs to be placed.
- Each node decides it’s ready to commit the write.
- Each node informs the others of its readiness.
- Each node actually commits.
Unfortunately, if any of the various messages in the 2PC process is delayed, so is the write. This creates way too much likelihood of work being blocked. And so modern approaches to distributed data writing are more … well, if I may repurpose the famous Facebook slogan, they tend to be along the lines of “Move fast and break things”,* with varying tradeoffs among consistency, other accuracy, reliability, functionality, manageability, and performance.
By the way — Facebook recently renounced that motto, in favor of “Move fast with stable infrastructure.” Hmm …
Back in 2010, I wrote about various approaches to consistency, with the punch line being:
A conventional relational DBMS will almost always feature RYW consistency. Some NoSQL systems feature tunable consistency, in which — depending on your settings — RYW consistency may or may not be assured.
The core ideas of RYW consistency, as implemented in various NoSQL systems, are:
- Let N = the number of copies of each record distributed across nodes of a parallel system.
- Let W = the number of nodes that must successfully acknowledge a write for it to be successfully committed. By definition, W <= N.
- Let R = the number of nodes that must send back the same value of a unit of data for it to be accepted as read by the system. By definition, R <= N.
- The greater N-R and N-W are, the more node or network failures you can typically tolerate without blocking work.
- As long as R + W > N, you are assured of RYW consistency.
That bolded part is the key point, and I suggest that you stop and convince yourself of it before reading further.
Eventually , Dan Abadi claimed that the key distinction is synchronous/asynchronous — is anything blocked while waiting for acknowledgements? From many people, that would simply be an argument for optimistic locking, in which all writes go through, and conflicts — of the sort that locks are designed to prevent — cause them to be rolled back after-the-fact. But Dan isn’t most people, so I’m not sure — especially since the first time I met Dan was to discuss VoltDB predecessor H-Store, which favors application designs that avoid distributed transactions in the first place.
One idea that’s recently gained popularity is a kind of semi-synchronicity. Writes are acknowledged as soon as they arrive at a remote node (that’s the synchronous part). Each node then updates local permanent storage on its own, with no further confirmation. I first heard about this in the context of replication, and generally it seems designed for replication-oriented scenarios.
Finally, let’s consider fault-tolerance within a single long-running job, whether that’s a big query or some other kind of analytic task. In most systems, if there’s a failure partway through a job, they just say “Oops!” and start it over again. And in non-extreme cases, that strategy is often good enough.
Still, there are a lot of extreme workloads these days, so it’s nice to absorb a partial failure without entirely starting over.
- Hadoop MapReduce, which stores intermediate results anyway, finds it easy to replay just the parts of the job that went awry.
- Spark, which is more flexible in execution graph and data structures alike, has a similar capability.
Additionally, both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once (presumably on different nodes), to hedge against the risk that any one copy of the process runs slowly or fails outright. According to my notes, speculative execution is a major part of NuoDB’ architecture as well.
I’ve rambled on for two long posts, which seems like plenty — but this survey is in no way complete. Other subjects I could have covered include but are hardly limited to:
- Occasionally-connected operation, which for example is a design point of CouchDB, SQL Anywhere (sort of), and most kinds of mobile business intelligence.
- Avoiding planned downtime — i.e., operating despite self-inflicted wounds.
- Data cleaning and master data management, both of which exist in large part to fix errors people have made in the past.
- Uninterrupted DBMS operation (September, 2012)
- The cardinal rules of DBMS development (March, 2013)
- Bottleneck Whack-A-Mole (August, 2009)