I talked with SAS about its new approach to parallel modeling. The two key points are:
- SAS no longer plans to go as far with in-database modeling as it previously intended.
- Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface).
The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.
A lot of what’s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:
- SAS wasn’t exploiting the capabilities of individual DBMS to their fullest; rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster’s analytic platform architecture is more flexible or powerful than Teradata’s didn’t help much with making SAS run within the Aster nCluster database.
- Notwithstanding everything else, SAS did make a certain set of modeling procedures run in-database.
- SAS’ previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.
SAS’ problems developing in-database modeling stem from, in essence, the limitations of UDFs (User Defined Functions). So why weren’t, for example, Teradata’s 2009 enhancements to its UDF capabilities enough? The clearest example SAS gave me is that, while database tables are commonly limited to something on the order of 1000 columns (their figure as well as mine), SAS might need 50-100,000 columns. One reason seems to be interactions between variables; SAS used the word “multiplied” a few times, but even so was coy about whether this could simply be regarded as quadratic terms in a regression. Another reason seems to be that in some cases, every value in a column spawns a new column in an intermediate table/array; indeed, this seems to be going on in the previously discussed case of logistic regression.
SAS code will be launched by the DBMS/data warehouse appliances, so potentially it can run under their native workload management. Teradata presumably has enough workload management richness to exploit that; EMC Greenplum, as of my August 2010 notes, probably did not.
SAS was gracious enough to let me post its slide deck, in both shorter and longer versions. Due to a technical glitch during the call, I neither looked at the slides nor took notes. I think the biggest loss from those difficulties is that I didn’t learn what the futures at the end of the longer deck were all about.
- Application areas for SAS HPA (April, 2011)
- SAS’ MPP story as of May, 2010
- SAS’ plans to run in-database on Teradata (October, 2007)