October 10, 2013

Aster 6, graph analytics, and BSP

Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:

There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.

There’s much more, of course, but those are the essential pieces.

Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.

The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):

BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.

Use cases suggested are a lot of marketing, plus anti-fraud.

*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.

So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:

Ordinary SQL except in the FROM clause.
Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.

Within those functions, the core idea is:

Various tables are explicitly given the roles of “Vertices”, “Edges”, and so on. (It can get reasonably complicated; e.g., “Vertices_1” and “Vertices_2” for a bipartite graph.)
Those “tables” can actually instead be views, subqueries or whatever.

Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:

PageRank (which of course generally is a way to estimate individual vertices’ relative influence).
Various things that seem to focus on measuring which relationships are or aren’t significant. (I’m not sure whether they’re NDA or not, so to stay on the safe side I won’t spell them out.)

Truth be told, these prebuilt functions sound pretty interesting.

As for underpinnings — the idea behind BSP is:

You have a computing job that is both iterative and parallel.
You parallelize it among a bunch of logical vertices, which may or may not correspond to physical computing servers.
The job is broken up into “supersteps”, wherein local processing happens at each vertex.
At the end of a superstep, each vertex can send messages to other vertices. The next superstep can’t start until all the messages have arrived.

Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.

So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.

Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.

Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.

Categories: Application areas, Aster Data, Business intelligence, Data models and architecture, Data warehousing, Hadoop, Parallelization, Predictive modeling and advanced analytics, RDF and graphs, Teradata

Subscribe to our complete feed!

Comments

4 Responses to “Aster 6, graph analytics, and BSP”

Ofir on October 10th, 2013 10:09 am

Just wanted to add that Apache Giraph also provides BSP-based graph processing (for Hadoop). Giraph recently got a nice backing when facebook detailed the reasons they picked it last year and the improvements they implemented and contributed since then, to support processing a billion edges. Interesting read:
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
I find it likely that Cloudera and Hortonworks will add support for this (or alternative graph libraries) rather soon, so it would allow Hadoop to be an alternative end-to-end batch / SQL / Graph ( / NoSQL / streaming) platform with commercial support. But propriety alternatives have their place, of course.
DB2 Hub | Aster 6, graph analytics, and BSP on October 16th, 2013 9:37 am

[…] Aster 6, graph analytics, and BSP Tags: arguments, aster, calpont, eai, eii, etl, elt, etlt, facebook, ibm, news, predictive modeling, rte, services, teradata, typeYou might also like ✎ ✎ ✎ ✎ ✎ ✎ ✎ ✎ /* […]
Rob Klopp on October 21st, 2013 12:22 pm

Hi Curt,

I apologize if this is a silly question… I am behind on my study of graph processing… but I you have to define a table as a vertex (or an edge) then why wouldn’t data partitioning work? Doesn’t the scheme suggest that each row in a vertex table is a vertex and edges get sent as messages to “join” to the vertex just like rows are sent to join today?

Rob
Curt Monash on October 21st, 2013 3:45 pm

Hi Rob,

Let’s consider the very simple case of a directed graph with unweighted edges. That can be represented in a relational database as a two-column table:

StartingVertex
EndingVertex

A row in the table just memorializes the existence of an edge.

If you want to ask the question “Which nodes are connected to others by exactly three hops?”, you auto-join the table to itself three times; a three-hop path will just be a row in the joined table.

It is an empirical fact/practical matter that — in most cases — to execute those joins you have to do a huge amount of data reshuffling; there will be no data distribution/partitioning scheme that can bail you out. Or, if there is such a scheme, it hasn’t been discovered and proven yet.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Aster 6, graph analytics, and BSP

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin