April 21, 2011

In-memory, parallel, not-in-database SAS HPA does make sense after all

I talked with SAS about its new approach to parallel modeling. The two key points are:

SAS no longer plans to go as far with in-database modeling as it previously intended.
Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface).

The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.

A lot of what’s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:

SAS wasn’t exploiting the capabilities of individual DBMS to their fullest; rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster’s analytic platform architecture is more flexible or powerful than Teradata’s didn’t help much with making SAS run within the Aster nCluster database.
Notwithstanding everything else, SAS did make a certain set of modeling procedures run in-database.
SAS’ previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.

SAS’ problems developing in-database modeling stem from, in essence, the limitations of UDFs (User Defined Functions). So why weren’t, for example, Teradata’s 2009 enhancements to its UDF capabilities enough? The clearest example SAS gave me is that, while database tables are commonly limited to something on the order of 1000 columns (their figure as well as mine), SAS might need 50-100,000 columns. One reason seems to be interactions between variables; SAS used the word “multiplied” a few times, but even so was coy about whether this could simply be regarded as quadratic terms in a regression. Another reason seems to be that in some cases, every value in a column spawns a new column in an intermediate table/array; indeed, this seems to be going on in the previously discussed case of logistic regression.

SAS code will be launched by the DBMS/data warehouse appliances, so potentially it can run under their native workload management. Teradata presumably has enough workload management richness to exploit that; EMC Greenplum, as of my August 2010 notes, probably did not.

SAS was gracious enough to let me post its slide deck, in both shorter and longer versions. Due to a technical glitch during the call, I neither looked at the slides nor took notes. I think the biggest loss from those difficulties is that I didn’t learn what the futures at the end of the longer deck were all about.

Related links

Application areas for SAS HPA (April, 2011)
SAS’ MPP story as of May, 2010
SAS’ plans to run in-database on Teradata (October, 2007)

Categories: Aster Data, Data warehouse appliances, Data warehousing, EMC, Greenplum, Memory-centric data management, Netezza, Parallelization, Predictive modeling and advanced analytics, SAS Institute, Teradata, Workload management

Subscribe to our complete feed!

Comments

7 Responses to “In-memory, parallel, not-in-database SAS HPA does make sense after all”

Application areas for SAS HPA | DBMS 2 : DataBase Management System Services on April 21st, 2011 3:24 am

[…] I talked with SAS about its forthcoming in-memory parallel SAS HPA offering, we talked briefly about application areas. The three SAS cited […]
unholyguy on April 21st, 2011 11:29 am

I think it is more then just the column limit, that the data access and internode communication algorithms that work well for MPP SQL are not suited for statistical analysis. Stat is much less set based when you come down to it

The key bits I got from the deck are

– Multi-pass methods, Only first pass should hit disk, keep data memory resident
– Even ostensibly simple problems might require more then one pass
– Chatty Node to Node communication

Similar problems to what the graph analysis people are trying to solve through Pregal and Hama, more of a BSP style compute

(http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel)
unholyguy on April 21st, 2011 11:30 am

might want to take a look at this google tech talk, think it’s relevant

http://www.youtube.com/watch?v=PBLgUBGWcz8
Curt Monash on April 21st, 2011 3:07 pm

Camuel Gilyadov has been propagandizing me about Pregel, and I haven’t been seeing the point. Thanks!
High Performance Analytics « DECISION STATS on April 22nd, 2011 4:53 am

[…] http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/ […]
Traditional databases will eventually wind up in RAM | DBMS 2 : DataBase Management System Services on May 23rd, 2011 11:06 am

[…] SAS HPA makes the argument that even “big data analytics” should sometimes be done in RAM. […]
Hadoop YARN — beyond MapReduce | DBMS 2 : DataBase Management System Services on July 23rd, 2012 4:26 am

[…] find that credible because of the Greenplum/SAS/MPI […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

In-memory, parallel, not-in-database SAS HPA does make sense after all

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin