Teradata Aster 6 has been preannounced (beta in Q4, general release in Q1 2014). The general architectural idea is:
- There are multiple data stores, the first two of which are:
- The classic Aster relational data store.
- A file system that emulates HDFS (Hadoop Distributed File System).
- There are multiple processing “engines”, where an engine is what occupies and controls a processing thread. These start with:
- Generic analytic SQL, as Aster has had all along.
- SQL-MR, the MapReduce Aster has also had all along.
- SQL-Graph aka SQL-GR, a graph analytics system.
- The Aster parser and optimizer accept glorified SQL, and work across all the engines combined.
There’s much more, of course, but those are the essential pieces.
Just to be clear: Teradata Aster 6, aka the Teradata Aster Discovery Platform, includes HDFS compatibility, native MapReduce and ways of invoking Hadoop MapReduce on non-Aster nodes or clusters — but even so, you can’t run Hadoop MapReduce within Aster over Aster’s version of HDFS.
The most dramatic immediate additions are in the graph analytics area.* The new SQL-Graph is supported by something called BSP (Bulk Synchronous Parallel). I’ll start by observing (and some of this is confusing):
- BSP was thought of a long time ago, as a general-purpose computing model, but recently has come to the fore specifically for graph analytics. (Think Pregel and Giraph, along with Teradata Aster.)
- BSP has a kind of execution-graph metaphor, which is different from the graph data it helps analyze.
- BSP is described as being a combination hardware/software technology, but Teradata Aster and everybody else I know of implements it in software only.
- Aster long ago talked of adding a graph data store, but has given up that plan; rather, it wants you to do graph analytics on data stored in tables (or accessed through views) in the usual way.
Use cases suggested are a lot of marketing, plus anti-fraud.
*Pay no attention to Aster’s previous claims to do a good job on graph — and not only via nPath — in SQL-MR.
So far as I can infer from examples I’ve seen, the semantics of Teradata Aster SQL-Graph start:
- Ordinary SQL except in the FROM clause.
- Functions/operators that are the arguments for FROM; of course, they output tables. You can write these yourself, or use Teradata Aster’s prebuilt ones.
Within those functions, the core idea is:
- Various tables are explicitly given the roles of “Vertices”, “Edges”, and so on. (It can get reasonably complicated; e.g., “Vertices_1″ and “Vertices_2″ for a bipartite graph.)
- Those “tables” can actually instead be views, subqueries or whatever.
Specific prebuilt functions — Aster is big on prebuilt functions — include but surely aren’t limited to:
- PageRank (which of course generally is a way to estimate individual vertices’ relative influence).
- Various things that seem to focus on measuring which relationships are or aren’t significant. (I’m not sure whether they’re NDA or not, so to stay on the safe side I won’t spell them out.)
Truth be told, these prebuilt functions sound pretty interesting.
As for underpinnings — the idea behind BSP is:
- You have a computing job that is both iterative and parallel.
- You parallelize it among a bunch of logical vertices, which may or may not correspond to physical computing servers.
- The job is broken up into “supersteps”, wherein local processing happens at each vertex.
- At the end of a superstep, each vertex can send messages to other vertices. The next superstep can’t start until all the messages have arrived.
Hopefully, various problems with message latency and unreliability that arise in other models of parallel computing are obviated by BSP.
So why use BSP for graph analytics? Well, it’s pretty obvious why BSP would be a decent model; the real question is why something that relies on classical data partitioning isn’t even better. And of course the answer to that one is that data partitioning doesn’t work for most graphs; whatever you do, there are going to be a whole lot of edges crossing partition boundaries.
Real-world graphs have short average path lengths — Six Degrees of Separation and all that. While that isn’t in itself a proof that partitioning can’t work, it should at least serve as a strong plausibility argument.
Since this is a first release of a graph-processing capability, it’s safe to assume there’s a lot missing. For example, every SQL-GR graph operation starts by retrieving data and building a graph; there’s no reuse. I presume that some analytic operations aren’t explicitly supported yet, or are of questionable performance. (Subgraph pattern matching was mentioned as an area that was not yet optimized for.) But with all those caveats, this still feels like a pretty interesting entry into the relationship analytics market.