The general Tokutek strategy has always been:
- Write indexes efficiently, which …
- … makes it reasonable to have more indexes, which …
- … lets more queries run fast.
But the details of “writes indexes efficiently” have been hard to nail down. For example, my post about Tokutek indexing last January, while not really mistaken, is drastically incomplete.
Adding further confusion is that Tokutek now has two product lines:
- TokuDB, a MySQL storage engine.
- TokuMX, in which the parts of MongoDB 2.2 that roughly equate to a storage engine are ripped out and replaced with Tokutek code.
TokuMX further adds language support for transactions and a rewrite of MongoDB’s replication code.
So let’s try again. I had a couple of conversations with Martin Farach-Colton, who:
- Is a Tokutek co-founder.
- Stayed in academia.
- Is a data structures guy, not a database expert per se.
The core ideas of Tokutek’s architecture start:
- There’s a tree of what serve as indexes, much as in a B-tree. The ultimate leaf nodes store actual data.
- Operations to alter the database — update, insert, schema change, etc. — sends messages to buffers at the appropriate nodes.
- The messages are resolved when buffers are flushed.
- The buffers are flushed just-in-time.
- The buffers of messages are themselves indexed. (Otherwise, determining which buffers contain information relevant to a particular query might require slow and tedious scans.)
A central concept is the interplay between the buffers and the write load.
- Except when buffers are flushed, writes go just to the buffers, and presumably are append-only.
- Buffers are flushed rarely — on average when they’re almost 25% full.
Early on Tokutek made the natural choice to flush buffers when they were touched by a query, but now buffers are just flushed when the total buffer pool runs out of space, fullest buffer first.
This all raises the question — what is a “message”? It turns out that there are a lot of possibilities. Four of the main ones are:
- Insert. The payload is the contents of the inserted row.
- Delete. The payload is the ID of the row being deleted. Since Tokutek is MVCC, a delete message is really an instruction to ignore a row that’s still there.
- Upsert. An upsert is an insert or update, to be determined after the system figures out if there’s a row already in place. So the payload to an upsert message is the payload to an insert, plus enough information to handle the update case. Martin stressed that Tokutek upserts do not require a query to check whether the row already exists, and hence can be 1-2 orders of magnitude faster than upserts in conventional RDBMS.
- Schema change. These are global, broadcast to every node. (And so schema changes can be done while the database is online.)
Since messages are a big part of what’s stored at a node, and they can have a variety of formats, columnar compression would be hard to implement. Instead, Tokutek offers a variety of standard block-level approaches.
A natural question to ask about OLTP (OnLine Transaction Processing) and other short-request DBMS is “When are there locks and latches?” Four cases came up:
- When a buffer is being flushed.
- When a node is being split. (As in the case of B-tree systems.)
- When a transaction requires row locks.
- When MySQL mandates table locks, for whatever arcane reasons it does so.
I forgot to ask whether the locks at buffer flushing time cause performance hiccups.
Other notes include:
- Martin believes that Tokutek read and write performance are both fairly optimal, but at the cost of some CPU cycles. By way of contrast, B-trees have optimal read performance, but can be slow to write.
- I gather Tokutek tried multiple strategies with similar characteristics, with the deciding factor being the difficulties in each approach in coding up database features such as ACID or MVCC (MultiVersion Concurrency Control), or in achieving concurrency.
- Tokutek also hacked special optimizations to be competitive in cases where B-trees are especially fast (the case mentioned was sequential insertions).
- Default node size is 4 megabytes.
- It seems that the branching factor is in line with a Bε-tree, rather than a B-tree, where ε is approximately the square root of the number of keys at a node. (I was confused by that part, but fortunately it seemed inessential.)
And finally — Tokutek has been slow to offer MySQL scale-out, but with the MongoDB version, scale-out is indeed happening. One would think that data could just be distributed among nodes in one of the usual ways, with all the indexes pertaining to that data stored at the same node as the data itself. So far as I can tell, that’s pretty close to being exactly what happens.