Tim Vincent of IBM talked me through DB2 pureScale Monday. IBM DB2 pureScale is a kind of shared-disk scale-out parallel OTLP DBMS, with some interesting twists. IBM’s scalability claims for pureScale, on a 90% read/10% write workload, include:
- 95% scalability up to 64 machines
- 90% scalability up to 88 machines
- 89% scalability up to 112 machines
- 84% scalability up to 128 machines
More precisely, those are counts of cluster “members,” but the recommended configuration is one member per operating system instance — i.e. one member per machine — for reasons of availability. In an 80% read/20% write workload, scalability is less — perhaps 90% scalability over 16 members.
Several elements are of IBM’s DB2 pureScale architecture are pretty straightforward:
- There are multiple pureScale members (machines), each with its own instance of DB2.
- There’s an RDMA (Remote Direct Memory Access) interconnect, perhaps InfiniBand. (The point of InfiniBand and other RDMA is that moving data doesn’t require interrupts, and hence doesn’t cost many CPU cycles.)
- The DB2 pureScale members share access to the database on a disk array.
- Each DB2 pureScale member has its own log, also on the disk array.
Something called GPFS (Global Parallel File System), which comes bundled with DB2, sits underneath all this. It’s all based on the mainframe technology IBM Parallel Sysplex.
The weirdest part (to me) of DB2 pureScale is something called the Global Cluster Facility, which runs on its own set of boxes. (Edit: Actually, see Tim Vincent’s comment below.) These might have 20% or so of the cores of the member boxes, with perhaps a somewhat higher percentage of RAM (especially in the case of write-heavy workloads). Specifically:
- The DB2 pureScale Global Cluster Facility maintains a buffer pool (cache) shared by all the DB2 pureScale members.
- Even so, the DB2 pureScale members themselves are in charge of disk access.
So what’s going on here is not an Exadata-like split between database server and storage processing tiers. The Global Cluster Facility also handles lock management, presumably because locking issues only arise when a page gets fetched into the buffer.
The other surprise is that every client talks to every member, usually through a connection pool from an app server. Tim Vincent assures me that DB2 connections are so lightweight this isn’t a problem. Clients have load-balancing code on behalf of the members, and route transactions to whichever pureScale member is least busy.
DB2 pureScale is designed to be pretty robust against outages:
- In the case of planned maintenance, a pureScale member can be “quiesced.” I.e., it stops being given new work; it finishes up its existing work; then maintenance happens; then the member starts being given work again.
- In the case of an unplanned outage, the redo log naturally comes into play. The pureScale twist on this is that a second small instance of DB2 is around — or is started up? — just to handled the redos.
Also, IBM believes that the DB2 pureScale locking strategy gives availability and performance advantages vs. the Oracle RAC (Real Application Cluster) approach. The distinction IBM draws is that any member can take over the lock on a buffer page from any other member, just by attempting to change the page — and the attempt will succeed; only row-level locks can ever block work. Thus, if a node fails, I/O can merrily proceed on other nodes, without waiting for any recovery effort. IBM’s target is <20 seconds for full row availability to be restored.
Obviously, it’s crucial that the Global Cluster Facility machines be fully mirrored, with no double failure — but so what? Modern computing systems have double-points-of-failure all over the place.