May 2, 2014

Introduction to CitusDB

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Notwithstanding what I said about “fat heads”, CitusDB does have a concept of Master nodes. These:

CitusDB is definitely in its early days. For example:

Still, the Citus Data folks seem to have good ideas, including some — as yet undisclosed — plans going forward. So if it sounds as if CitusDB might fit your needs better than more established scale-out RDBMS do, I’d encourage you to take a look at what Citus offers.

Comments

6 Responses to “Introduction to CitusDB”

  1. Umur Cubukcu on May 2nd, 2014 5:16 pm

    A few comments / clarifications:
    (1) We have fully parallelized COUNT DISTINCT and have customers relying on it in production. For fastest results, we also compute accurate approximations (via hyper-log-log cardinality estimation).

    (2) We use a “modular block” architecture that improves upon the approach of using multiple virtual nodes / database instances per physical node. The conceptual difference is that instead of having 1000 virtual databases on each node, CitusDB has 1 database with 1000 tables on a node. This is similar to HDFS in spirit, and has important implications on improved resource utilization, fault tolerance, and elastic scalability.

  2. Umur Cubukcu on May 2nd, 2014 5:19 pm

    CitusDB query optimization works in two stages: (i) The global optimizer first optimizes for network I/O, using the metadata it has about each modular block. (2) The local optimizer then optimizes for disk I/O, leveraging all the Postgres statistics collection. All put together, significant information is taken into account during query optimization — we are making further improvements, of course.

  3. Umur Cubukcu on May 2nd, 2014 6:16 pm

    On the count distincts, more precisely:
    - If the database is already partitioned on the distinct column, we give exact results by pushing down the distinct.
    - If not, we use the hyper-log-log approximation to push the parallelization. (We are using the popular postgres-hll extension for it.)

  4. Kelly Stirman on May 2nd, 2014 6:30 pm

    “CitusDB follows the modern best-practice of having many virtual nodes on each physical node.”

    why is this a best-practice? do you mean for PG, or for all software?

  5. Curt Monash on May 4th, 2014 4:15 am

    Kelly,

    I mean for a large fraction of all scale-out software. It makes things go better when you need to redistribute data for some reason — node outage, planned cluster expansion, whatever.

  6. Hans on May 5th, 2014 6:43 am

    Sounds promising. It would be interesting how CitusDB differs from Postgres-XC regarding architecture and features (sounds as if both are very similar in the approach they took).

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.