It feels like time to write about Clustrix, which I last covered in detail in May, 2010, and which is releasing Clustrix 4.0 today. Clustrix and Clustrix 4.0 basics include:
- Clustrix makes a short-request processing appliance.
- As you might guess from the name, Clustrix is clustered — peer-to-peer, with no head node.
- The Clustrix appliance uses flash/solid-state storage.
- Traditionally, Clustrix has run a MySQL-compatible DBMS.
- Clustrix 4.0 introduces JSON support. More on that below.
- Clustrix 4.0 introduces a bunch of administrative features, and parallel backup.
- Also in today’s announcement is a Rackspace partnership to offer Clustrix remotely, at monthly pricing.
- Clustrix has been shipping product for about 4 years.
- Clustrix has 20 customers in production, running >125 Clustrix nodes total.
- Clustrix has 60 people.
- List price for a (smallest size) Clustrix system is $150K for 3 nodes. Highest-end maintenance costs 15%.
- There’s also a $100K version meant for high availability/disaster recovery. Over half of Clustrix’s customers use off-site disaster recovery.
- Clustrix is raising a C round. Part of it has already been raised from insiders, as a kind of bridge.
The biggest Clustrix installation seems to be 20 nodes or so. Others seem to have 10+. I presume those disaster recovery customers have 6 or more nodes each. I’m not quite sure how the arithmetic on that all works; perhaps the 125ish count of nodes is a bit low.
Clustrix technical notes include:
- Clustrix is MVCC (Multi-Version Concurrency Control).
- Clustrix exploits MVCC to allow online, lockless schema changes. Clustrix says these changes are typically single-column, for example an add or a widening/datatype change.
- Clustrix indexes are a mix of b-trees and log-structured merge files.
- Clustrix sounds like it’s paid attention to being multi-core. For example, DR replication is via parallel, multi-core log streaming, going single-core only when transactions have the potential to influence each other.
- MySQL features Clustrix lacks include triggers and XML support.
- Clustrix uses MLC flash.
Clustrix doesn’t have compression, with the usual excuse of excessive CPU cost. When I pointed out that dictionary/token compression is cheap, Clustrix cofounder/CTO Sergei Tsarev suggested that it doesn’t make sense now due to high cardinalities in OLTP workloads, but could become more important as more analytic use cases emerge.
Clustrix’ JSON story seems to be:
- The JSON goes into a relational column.
- Fields inside a JSON document can be indexed.
- One can then reference those fields in SQL just as if they were relational columns, including in joins.
- If you’re reckless when joining on multi-valued fields, trouble could in theory ensue.
That sounds a lot like other schemes for sticking documents into relational BLOBs/CLOBs (Binary/Character Large OBjects), although it happens to be the first time I’ve heard it in connection with JSON.
Clustrix has one cool idea I haven’t heard from anybody else, which I’m calling index distribution. The idea is that each index can be distributed differently across the cluster (this includes the JSON secondary indexes), i.e. on different distribution keys. Clustrix thinks that paying special attention to index distribution and movement is helpful to the performance of distributed joins.
I still wish Clustrix were available on a software-only/bring your own hardware/bring your own cloud basis. Absent that, pricing and lock-in are concerns. True, I didn’t immediately see any flaws in Clustrix’ claims that its Rackspace offering was at once cheaper and more performant than MySQL on Amazon; but then, Amazon isn’t always that cost-effective an option. Price aside, Clustrix does sound as if it’s one of a number of appealing NewSQL options, and probably even one of the (relatively speaking) more proven ones.