July 19, 2012

A data distribution idea at Vertica and Clustrix

Yesterday I wrote:

Clustrix has one cool idea I haven’t heard from anybody else, which I’m calling index distribution. The idea is that each index can be distributed differently across the cluster …  i.e. on different distribution keys. Clustrix thinks that paying special attention to index distribution and movement is helpful to the performance of distributed joins.

While that’s true, I thought I’d heard something similar from Vertica; so I checked, and indeed I had. Vertica famously lets you store columns in different sort orders, in both reasonable senses:

It turns out those columns can also be distributed on different keys as well.

Related link

Comments

5 Responses to “A data distribution idea at Vertica and Clustrix”

  1. John Hugg on July 19th, 2012 6:26 pm

    Back in the H-Store days, we considered this feature for what became VoltDB as well (some of us were Vertica folks after all).

    We decided this was difficult to leverage in always-available high-throughput OLTP workloads. If a machine that has the index you need for a particular query fails, then that query goes from an index scan to a seqential scan. The difference in performance can be as many as 5 orders of magnitude in not unrealistic cases.

    To make this feature really useful, you’d have to use non-replicated indexes only for a non-critical part of your application, and you’d have to be able to turn these queries off or throttle them in the face of node failure.

    Of course, if you’re replicating just for durability and not for availability, then I suppose you gain some storage efficiency.

  2. Curt Monash on July 19th, 2012 8:55 pm

    Hi John,

    Clustrix indexes have replication factors too. Sergei told me reasonable defaults were replication factors of 3 for data, 2 for indexes. But a particular crucial index could have a higher factor.

  3. Sergei Tsarev on July 20th, 2012 10:44 am

    A typical Clustrix customer has 100s of tables, and an average table will have several indexes (usually more than 2 but less than 10 – but there are outliers of course).

    Most of these apps have machine generated queries, with a typical app generating 1,000s of distinct query signatures. That’s a lot of query – index interactions to track.

    It would be impractical for most of our customers to identify which sets of indexes are less important than others. And if you get it wrong, some portion of your app becomes broken in a simple failure case. It’s just not worth it to try and save a few writes.

  4. clieu on May 9th, 2013 5:55 pm

    Clustrix has some new technical documentation that describes how data distribution works:
    http://docs.clustrix.com/display/CLXDOC/Data+Distribution

  5. Some Vertica 6 features | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:15 am

    [...] stand for now on my previous coverage of Vertica’s database [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.