Crate.io and CrateDB basics include:
- Crate.io makes CrateDB.
- CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
- CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
- Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
- Crate.io says it has 22 employees and 5 paying customers.
- Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.
In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.
CrateDB’s not-just-relational story starts:
- A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
- … where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
- … except when they’re just BLOBs (Binary Large OBjects).
- There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
- There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.
Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.
One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:
- Writes and single-row look-ups.
- Aggregates and joins.
And so it makes sense that:
- Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
- The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.
CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.
CrateDB technical highlights include:
- CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
- CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
- In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
- More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
- CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
- Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
- While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
- Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
- The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.
Crate.io is proud of its distributed/parallel story.
- Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
- Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
- Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.
The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.
Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:
- A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
- The only join strategy currently implemented is nested loop. Others are in the future.
- CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
- Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
- I imagine CrateDB administrative tools are still rather primitive.
In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.
Edit: For some clarification and even correction, please see the first comment below.