September 2, 2008

Introduction to Aster Data and nCluster

I’ve been writing a lot about Greenplum since a recent visit. But on the same trip I met with Aster Data, and have talked with them further since. Let me now redress the balance and outline some highlights of the Aster Data story.

The basics include:

Interesting is Aster’s approach to parallel query, which is very focused on reducing the amount of data that needs to be moved around.

And to answer a question that really should be asked of all MPP DBMS vendors – when the query executor breaks queries into multiple parts, some of the parts can be primitives rather than just more SQL.

Aster Data also has a pretty interesting story about MPP manageability, based on what seems to be a fair amount of autonomic computing. In particular, you can plug in bare metal – without even an operating system – and the system will install and incorporate it. All this happens in 30 minutes. Even if a node goes down, failover is handled so automagically that queries don’t fail. (Of course, there’s a performance blip.) Backup and bulk data transfer/loading are both parallel and incremental. The system does not use any empty hot standbys. (That said, if Aster’s evolution parallels other vendors’, hot spare disks may eventually show up in the architecture.)

There are more parts of the Aster Data story I want to write about, namely node heterogeneity and MapReduce syntax, but for now I’ll stop here and post this. I’d also like to point you at the Aster Data blog, which is remarkable in its level of architectural detail.


4 Responses to “Introduction to Aster Data and nCluster”

  1. Steve Wooledge on September 3rd, 2008 2:21 pm

    Hi Curt,

    Thanks for the post. Just a couple points for clarification:

    [1] At MySpace, every piece of data has 2 copies on distinct nodes. More specifically, at MySpace, as well as our other customers, they use RAID 0 on the Aster Worker nodes and RAID 10 on the Aster Queen nodes. [more on our 3-tiered architecture here: (] Our recommendation is to always use RAID 0 on the workers, because it gives you better performance when a disk fails: with RAID10, if a disk fails, the node stays available, but the performance of that node drops by 50% (and, thus, the performance of the cluster). Because we have full replication and transparent failover, in Aster nCluster if a disk fails, the entire node goes down, but nCluster’s performance only drops by 1/n th (where n is the number of nodes).

    [2] re: “parallel query” – The local GROUP BYs is an example; our query optimization algorithms cover the relational algebra and not just 1 case.

  2. Roger on September 3rd, 2008 5:53 pm

    What their pricing model? Per terabyte? And how much it costs?

  3. Web analytics — clickstream and network event data | DBMS2 -- DataBase Management System Services on September 22nd, 2008 6:11 am

    […] Data’s largest disclosed database, by almost two orders of magnitude, is at […]

  4. Confluence: Client: Telefonica I+D on February 5th, 2010 11:23 am

    personalization server old code and architecture review…

    (form SVN) (see part list at…

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.