Writing data management or analysis software is hard. This post and its sequel are about some of the reasons why.
When systems work as intended, writing and reading data is easy. Much of what’s hard about data management is dealing with the possibility — really the inevitability — of failure. So it might be interesting to survey some of the many ways that considerations of failure come into play. Some have been major parts of IT for decades; others, if not new, are at least newly popular in this cluster-oriented, RAM-crazy era. In this post I’ll focus on topics that apply to single-node systems; in the sequel I’ll emphasize topics that are clustering-specific.
Major areas of failure-aware design — and these overlap greatly — include:
- Backup and restore. In its simplest form, this is very basic stuff. That said — any decent database management system should let backups be made without blocking ongoing database operation, with the least performance impact possible.
- Logging, rollback and replay. Logs are essential to DBMS. And since they’re both ubiquitous and high-performance, logs are being used in ever more ways.
- Locking, latching, transactions and consistency. Database consistency used to be enforced in stern and pessimistic ways. That’s changing, big-time, in large part because of the requirements of …
- … distributed database operations. Increasingly, modern distributed database systems are taking the approach of getting work done first, then cleaning up messes when they occur.
- Redundancy and replication. Parallel computing creates both a need and an opportunity to maintain multiple replicas of data at once, in very different ways than the redundancy and replication of the past.
- Fault-tolerant execution. When one node is inoperative, inaccessible, overloaded or just slow, you may not want a whole long multi-node job to start over. A variety of techniques address this need.
In a single-server, disk-based configuration, techniques for database fault-tolerance start:
- Database changes (inserts, deletes, updates) are applied expeditiously to disk. Furthermore …
- … a log is kept of the (instructions for) changes. If a change is detected as not going through, it can be reapplied.
- Data is often kept in multiple copies automagically, whether that is governed by the storage systems or the DBMS itself, with background resyncing if one copy is known to go bad.
- These ideas can be pushed further into:
- High availability — the database is mirrored onto storage that is controlled by a second server, which runs the same software as the primary.
- Disaster recovery — same idea as HA, but off-site to protect against site disasters. DR often cuts various corners, as compared to same-site HA, in the speed and/or quality of service with which the remote instance would take over from the primary site.
Valuable though they are, none of these techniques protects against data corruption that occurs via software errors or security breaches, because in those cases the system will likely do a great job of making incorrect changes consistently across all copies of the data. Hence there also needs to be a backup/restore capability, key aspects of which start:
- You periodically create and lock down snapshots of the database.
- In case of comprehensive failure, you load the most recent snapshot you trust, and roll forward by re-applying changes memorialized in the log. (Of course, you avoid re-applying spurious changes that should not have occurred.)
For many users, it is essential that backups be online/continuous, rather than requiring the database to periodically be taken down.
None of this is restricted to what in a relational database would be single-row or at least single-table changes. Transaction semantics cover the case that several changes must either all change or all fail together. Typically all the changes are made; the system observes that they’ve all been made; only then is a commit issued which makes the changes stick. One complexity of this approach is that you need a way to quickly undo changes that don’t get committed.
Locks, latches and timestamps
If a database has more than one user, two worrisome possibilities arise:
- Two conflicting changes might be attempted against the same data.
- One user might try to read data at the moment another user’s request was changing it.
So it is common to lock the portion of the database that is being changed. A major area of competition in the early 1990s — and a big contributor to Sybase’s decline — was the granularity of locks (finer-grained is better, with row-level locking being the relational DBMS gold standard).
In-memory locks are generally called latches; I don’t know what the other differences between locks and latches are.
Increasingly many DBMS are designed with an alternative approach, MVCC (Multi-Version Concurrency Control); indeed, I’m now startled when I encounter one that makes a different choice. The essence of MVCC is that, to each portion (e.g. row) of data, the system appends very granular metadata — a flag for validation/invalidation and/or a timestamp. Except for the flags, the data is never changed until a cleanup operation; rather, new data is written, with later timestamps and validating flags. The payoff is that, while it may still be necessary to lock data against the possibility of two essentially simultaneous writes, reads can proceed without locks, at the cost of a minuscule degree of freshness. For if an update is in progress that affects the data needed for a read, the read just targets versions of the data slightly earlier in time.
MVCC has various performance implications — writes can be faster because they are append-mainly, reads may be slower, and of course the cleanup is needed. These tradeoffs seem to work well even for the query-intensive workloads of analytic RDBMS, perhaps because MVCC fits well with their highly sequential approach to I/O.
A DBMS that truly ran only in RAM would lose data each time the power turned off. Hence a memory-centric DBMS will usually also have some strategy for persisting data.