October 10, 2013

Aster 6, graph analytics, and BSP

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Within those functions, the core idea is: 

Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:

Truth be told, these prebuilt functions sound pretty interesting.

As for underpinnings — the idea behind BSP is:

Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.

So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.

Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.

Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.

Comments

4 Responses to “Aster 6, graph analytics, and BSP”

  1. Ofir on October 10th, 2013 10:09 am

    Just wanted to add that Apache Giraph also provides BSP-based graph processing (for Hadoop). Giraph recently got a nice backing when facebook detailed the reasons they picked it last year and the improvements they implemented and contributed since then, to support processing a billion edges. Interesting read:
    https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
    I find it likely that Cloudera and Hortonworks will add support for this (or alternative graph libraries) rather soon, so it would allow Hadoop to be an alternative end-to-end batch / SQL / Graph ( / NoSQL / streaming) platform with commercial support. But propriety alternatives have their place, of course.

  2. DB2 Hub | Aster 6, graph analytics, and BSP on October 16th, 2013 9:37 am

    [...] Aster 6, graph analytics, and BSP Tags: arguments, aster, calpont, eai, eii, etl, elt, etlt, facebook, ibm, news, predictive modeling, rte, services, teradata, typeYou might also like ✎ ✎ ✎ ✎ ✎ ✎ ✎ ✎ /* [...]

  3. Rob Klopp on October 21st, 2013 12:22 pm

    Hi Curt,

    I apologize if this is a silly question… I am behind on my study of graph processing… but I you have to define a table as a vertex (or an edge) then why wouldn’t data partitioning work? Doesn’t the scheme suggest that each row in a vertex table is a vertex and edges get sent as messages to “join” to the vertex just like rows are sent to join today?

    Rob

  4. Curt Monash on October 21st, 2013 3:45 pm

    Hi Rob,

    Let’s consider the very simple case of a directed graph with unweighted edges. That can be represented in a relational database as a two-column table:

    StartingVertex
    EndingVertex

    A row in the table just memorializes the existence of an edge.

    If you want to ask the question “Which nodes are connected to others by exactly three hops?”, you auto-join the table to itself three times; a three-hop path will just be a row in the joined table.

    It is an empirical fact/practical matter that — in most cases — to execute those joins you have to do a huge amount of data reshuffling; there will be no data distribution/partitioning scheme that can bail you out. Or, if there is such a scheme, it hasn’t been discovered and proven yet.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.