In a nutshell, ScaleDB’s proposition is:
Innovative approach to indexing relational DBMS, providing performance advantages.
Shared-everything scale-up that ScaleDB believes will leapfrog the MySQL engine competition already in Release 1. (In my opinion, this is the least plausible part of the ScaleDB story.)
State-of-the-art me-too facilities for locking, logging, replication/fail-over, etc., also already in Release 1.
Like many software companies with non-US roots, ScaleDB seems to have started with a single custom project, using a Patricia trie indexing system. Then they decided Patricia tries might be really useful for relational OLTP as well. The ScaleDB team now features four developers, plus half-time or so “Chief Architect” involvement from Vern Watts. Watts seems to pretty much have been Mr. IMS for the past four decades, and thus surely knows a whole lot about pointer-based database management systems; presumably, he’s responsible for the generic DBMS design features that are being added to the innovative indexing scheme. On ScaleDB’s advisory board is PeopleSoft veteran Rick Berquist, about whom I’ve had fond thoughts ever since he talked me into focusing on consulting as the core of my business.*
*More precisely, Rick pretty much tricked me into doing a day of consulting for $15K, then revealed that’s what he’d done, expressing the thought that he’d very much gotten his money’s worth. But I digress …
ScaleDB has no customers to date, but hopes to be in beta by the end of this year. Angels and a small VC firm have provided bridge loans; otherwise, ScaleDB has no outside investment. ScaleDB’s business model thoughts include:
$1,000/server/year license fee, or something in that range.
Early focus on Web 2.0 kinds of customers (e.g., social networking companies may enjoy the join performance ScaleDB plans to offer).
Early focus on MySQL OLTP (but, like proud parents everywhere, they think the technology is so wonderful that it could eventually be pretty much all things to all people).
The company is based in Menlo Park, CA.
Probably I should explain what Patricia tries actually are, and how they can help relational DBMS. An ordinary trie* is a way of indexing data that looks a lot like – unsurprisingly – a tree. For example, suppose you need to index a lot of character strings, each consisting of lower-case Latin letters. From the root node you point to the 26 possibilities for starting letter. From those you point to the next possible letter, and so on. Combinatorial explosion is averted because you only have edges if there’s actually a string with that letter combination. Thus, when indexing a corpus of classic novels, there might be a path i-t-i-s-a-t-r-u-t-h-u-n-… and so on, but none that starts i-a-u-z-z-z.
*”Trie” is sometimes pronounced like “tree”, sometimes like “try.”
Patricia tries add a now-obvious compression technique. Namely, if there’s only one branch from a node, just collapse it. Thus, the example I gave above would become something more like i-t-i-s-a-truth-universally-acknowledged-…, or perhaps something even more compact.
While these ideas were evidently invented with text documents in mind, there’s no reason they can’t be applied to other kinds of strings – specifically, to those stored in relational databases. (And numbers can just be treated as strings of bits.) As I wrote last year in discussing solidDB, which uses a similar approach:
The canonical index structure in a disk-centric OLTP RDBMS is a tree of blocks. The record sought is in a block somewhere. There are index blocks whose entries are pointers to the correct block based on values in the index column. There are index blocks of pointers to other index blocks. And so on. One can traverse these trees in very few steps, but each step is costly, because each step involves examining the whole block.
SolidDB, by way of contrast, uses a core index structure called the trie. The key value on which the record search is based is divided into chunks of bits. Each chunk leads to a tree node with a small number of choices for the next chunk. There are more steps, but each step is much cheaper.
Benefits of this strategy include compression and in-memory performance. But a naive implementation would, as in other pointer-based systems, lead to unacceptable disk thrashing. ScaleDB’s answer is to layer the index, essentially creating a “trie of tries.” The company confidently claims that, in almost all cases, data can be found via a single disk read. Part of that story is the assertion that their indexing scheme achieves tremendous compression vs. conventional b-trees.
So far, that all sounds like a performance win, of unclear magnitude. (ScaleDB says it’s hoping for a 3X or better performance advantage versus traditional b-tree-based approaches.) But there’s another cool part as well. The ScaleDB trie doesn’t necessarily end with the first row it finds; it also reaches through to capture foreign-key relationships. E.g., if customer FOO123 places an order with OrderID BAR456, the BAR456 isn’t just found via the path B-A-R-4-5-6. It also can be found via FOO-1-2-3-BAR-456. Thus, referential integrity and updatable views are baked into the core database management architecture.
I look forward to seeing how this all works out, in Release 1 and beyond.
Edit: One way to think of this as the integration of the network and relational data models, ala IDMS/R, but with more compact linked lists. And I believe Predrag Dizdarevic when he tells me IDMS/R did wind up working pretty well, in a rare instance of a DBMS technology success post acquisition by CA.