I’ve been talking a fair bit with Cory Isaacson, CEO of my client CodeFutures, which makes dbShards. Business notes include:
- 7 production users, plus an 8th imminent.
- 12-14 signed contracts beyond that.
- ~160 servers in production.
- One customer who has almost 15 terabytes of data (in the cloud).
- Still <10 people, pretty much all engineers.
- Profitable, but looking to raise a bit of growth capital.
Apparently, the figure of 6 dbShards customers in July, 2010 is more comparable to today’s 20ish contracts than to today’s 7-8 production users. About 4 of the original 6 are in production now.
NDA stuff aside, the main technical subject we talked about is something Cory calls “relational sharding”. The point is that dbShards’ transparent sharding can be done in such a way as to make many joins be single-server. Specifically:
- When a table is sufficiently small to be replicated in full at every nodes, you can join on it without moving data across the network.
- When two tables are sharded on the same key, you can join on that key without moving data across the network.
dbShards can’t do cross-shard joins, but it can do distributed transactions comprising multiple updates. Cory argues persuasively that in almost all cases this is enough; but I see cross-shard joins as a feature that should someday be added to dbShards even so.
The real issue with dbShards’ transparent sharding is ensuring it’s really transparent. Cory regards as typical a customer with a couple thousand tables, who had to change a dozen or so SQL statements to implement dbShards. But there are near-term plans to automate the number of SQL changes needed down to 0. The essence of that change is this:
- Today, dbShards requires tables all to have a column for the shard key if they are to be sharded together.
- In the future, there will be a “shard index” that relaxes that requirement.
Problems that the shard index will solve include:
- If you have a Customer-Order-LineItem hierarchy, dbShards wants the CustomerID to be explicit in the LineItem table. The ALTER TABLE part is easy. But then you also have to make sure the rows are properly updated going forward, including the new column.
- Suppose you shard on UserID, but logins are actually done via UserName. Right now dbShards wants a new two-column table related UserID to UserName, sharded on UserName.
While the implementation details are surely in most cases different, philosophically this is a lot like the issues Akiban’s architecture also deals with.
Some specific feature points are:
- You start off with a short, manual process of putting tables into “shard groups” associated by shard key.
- Tables that you don’t put into a shard group are replicated globally.
- Tables in each shard group are automatically sharded, based on foreign key relationships.
- If the foreign key relationships are already explicit — e.g. in support of referential integrity — you’re set.
- Otherwise, you need to enter them, helped along by a wizard that looks at common names and so on.
- Besides the handwork gotchas I wrote about above, one other is that you might need to give a few hints to the parser.
Cory also took the opportunity to call out an advantage of the dbShards replication scheme, whose basic story has always been “it’s fast while also being pretty reliable”. Specifically — and this has evidently been tested by a user with 16 nodes plus 16 hot spares/secondary copies — you can:
- Take your secondary nodes out of production.
- ALTER TABLE on them.
- Keep running replication to collect updates.
- Apply the updates despite the schema change.
- Promote the secondary copies to be primary.
- Do it all over again for the former primaries.