Aster 6, graph analytics, and BSP
Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:
- There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
- There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
- The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.
There’s much more, of course, but those are the essential pieces.
Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.
The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):
- BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
- BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
- BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
- Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.
Use cases suggested are a lot of marketing, plus anti-fraud.
*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.
So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:
- Ordinary SQL except in the FROM clause.
- Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.
Within those functions, the core idea is:
- Various tables are explicitly given the roles of “Vertices”, “Edges”, and so on. (It can get reasonably complicated; e.g., “Vertices_1” and “Vertices_2” for a bipartite graph.)
- Those “tables” can actually instead be views, subqueries or whatever.
Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:
- PageRank (which of course generally is a way to estimate individual vertices’ relative influence).
- Various things that seem to focus on measuring which relationships are or aren’t significant. (I’m not sure whether they’re NDA or not, so to stay on the safe side I won’t spell them out.)
Truth be told, these prebuilt functions sound pretty interesting.
As for underpinnings — the idea behind BSP is:
- You have a computing job that is both iterative and parallel.
- You parallelize it among a bunch of logical vertices, which may or may not correspond to physical computing servers.
- The job is broken up into “supersteps”, wherein local processing happens at each vertex.
- At the end of a superstep, each vertex can send messages to other vertices. The next superstep can’t start until all the messages have arrived.
Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.
So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.
Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.
Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.
Comments
4 Responses to “Aster 6, graph analytics, and BSP”
Leave a Reply
Just wanted to add that Apache Giraph also provides BSP-based graph processing (for Hadoop). Giraph recently got a nice backing when facebook detailed the reasons they picked it last year and the improvements they implemented and contributed since then, to support processing a billion edges. Interesting read:
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
I find it likely that Cloudera and Hortonworks will add support for this (or alternative graph libraries) rather soon, so it would allow Hadoop to be an alternative end-to-end batch / SQL / Graph ( / NoSQL / streaming) platform with commercial support. But propriety alternatives have their place, of course.
[…] Aster 6, graph analytics, and BSP Tags: arguments, aster, calpont, eai, eii, etl, elt, etlt, facebook, ibm, news, predictive modeling, rte, services, teradata, typeYou might also like ✎ ✎ ✎ ✎ ✎ ✎ ✎ ✎ /* […]
Hi Curt,
I apologize if this is a silly question… I am behind on my study of graph processing… but I you have to define a table as a vertex (or an edge) then why wouldn’t data partitioning work? Doesn’t the scheme suggest that each row in a vertex table is a vertex and edges get sent as messages to “join” to the vertex just like rows are sent to join today?
Rob
Hi Rob,
Let’s consider the very simple case of a directed graph with unweighted edges. That can be represented in a relational database as a two-column table:
StartingVertex
EndingVertex
A row in the table just memorializes the existence of an edge.
If you want to ask the question “Which nodes are connected to others by exactly three hops?”, you auto-join the table to itself three times; a three-hop path will just be a row in the joined table.
It is an empirical fact/practical matter that — in most cases — to execute those joins you have to do a huge amount of data reshuffling; there will be no data distribution/partitioning scheme that can bail you out. Or, if there is such a scheme, it hasn’t been discovered and proven yet.