In-memory, parallel, not-in-database SAS HPA does make sense after all
I talked with SAS about its new approach to parallel modeling. The two key points are:
- SAS no longer plans to go as far with in-database modeling as it previously intended.
- Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface).
The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.
A lot of what’s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:
- SAS wasn’t exploiting the capabilities of individual DBMS to their fullest; rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster’s analytic platform architecture is more flexible or powerful than Teradata’s didn’t help much with making SAS run within the Aster nCluster database.
- Notwithstanding everything else, SAS did make a certain set of modeling procedures run in-database.
- SAS’ previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.
SAS’ problems developing in-database modeling stem from, in essence, the limitations of UDFs (User Defined Functions). So why weren’t, for example, Teradata’s 2009 enhancements to its UDF capabilities enough? The clearest example SAS gave me is that, while database tables are commonly limited to something on the order of 1000 columns (their figure as well as mine), SAS might need 50-100,000 columns. One reason seems to be interactions between variables; SAS used the word “multiplied” a few times, but even so was coy about whether this could simply be regarded as quadratic terms in a regression. Another reason seems to be that in some cases, every value in a column spawns a new column in an intermediate table/array; indeed, this seems to be going on in the previously discussed case of logistic regression.
SAS code will be launched by the DBMS/data warehouse appliances, so potentially it can run under their native workload management. Teradata presumably has enough workload management richness to exploit that; EMC Greenplum, as of my August 2010 notes, probably did not.
SAS was gracious enough to let me post its slide deck, in both shorter and longer versions. Due to a technical glitch during the call, I neither looked at the slides nor took notes. I think the biggest loss from those difficulties is that I didn’t learn what the futures at the end of the longer deck were all about.
Related links
- Application areas for SAS HPA (April, 2011)
- SAS’ MPP story as of May, 2010
- SAS’ plans to run in-database on Teradata (October, 2007)
Comments
7 Responses to “In-memory, parallel, not-in-database SAS HPA does make sense after all”
Leave a Reply
[…] I talked with SAS about its forthcoming in-memory parallel SAS HPA offering, we talked briefly about application areas. The three SAS cited […]
I think it is more then just the column limit, that the data access and internode communication algorithms that work well for MPP SQL are not suited for statistical analysis. Stat is much less set based when you come down to it
The key bits I got from the deck are
– Multi-pass methods, Only first pass should hit disk, keep data memory resident
– Even ostensibly simple problems might require more then one pass
– Chatty Node to Node communication
Similar problems to what the graph analysis people are trying to solve through Pregal and Hama, more of a BSP style compute
(http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel)
might want to take a look at this google tech talk, think it’s relevant
http://www.youtube.com/watch?v=PBLgUBGWcz8
Camuel Gilyadov has been propagandizing me about Pregel, and I haven’t been seeing the point. Thanks!
[…] http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/ […]
[…] SAS HPA makes the argument that even “big data analytics” should sometimes be done in RAM. […]
[…] find that credible because of the Greenplum/SAS/MPI […]