December 18, 2016

Introduction to Crate.io and CrateDB

Crate.io and CrateDB basics include:

In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.

CrateDB’s not-just-relational story starts:

Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.

One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:

And so it makes sense that:

CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.

CrateDB technical highlights include:

Crate.io is proud of its distributed/parallel story.

The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.

Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:

In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.

Edit: For some clarification and even correction, please see the first comment below. 🙂

Comments

3 Responses to “Introduction to Crate.io and CrateDB”

  1. Jodok on December 20th, 2016 8:27 am

    Hi Curt. I enjoyed our discussion last week. Thanks for the write up!

    The idea behind CrateDB is to build a distributed SQL database that you can deploy and scale like many of the popular NoSQL databases.

    As you point out, early generation analytic databases leave room for improvement in a number of areas: handling more complex queries, joins, aggregations, ad-hoc queries, non-tabular data types, and so on.

    Underlying the CrateDB SQL query engine are some ideas and code from projects like Elasticsearch and Lucene that start to solve a lot of this.

    To clarify a few things:

    CrateDB stores records as real JSON documents.

    The CrateDB data distribution engine borrows from Elasticsearch. So when discussing sharding, it’s not correct to say we use consistent hashing. Like Elasticsearch, CrateDB uses modulo-based routing of data to a fixed number of shards per physical table (the number of shards is user defined at table creation time).

    We create more elasticity friendliness via partition tables, which are logical views. For example, a new partition can be created for every new day in a log entry date column. You can vary the number of shards per partition as needed.

    We leverage Lucene Field-level Caches (memory based) and Lucene Index Doc Values (can be memory based or persistent on disk) (http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/) to speed up data retrieval; both for single-record look up and query processing.

    -Jodok

  2. Manoj on January 3rd, 2017 9:22 pm

    Nice Article ! Useful and informative

  3. David Gruzman on January 26th, 2017 6:07 am

    As far as I understood CrateDB is working with arbitrary JSON. Giving good support for nested data from JSON is a challenge. So I would like to ask:
    1. what are SQL language extensions CrateDB have to work with JSON?
    2. what are capabilities to “discover” actual schema using queries?

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.