May 7, 2010

Clarifying the state of MPP in-database SAS

I routinely am briefed way in advance of products’ introductions. For that reason and others, it can be hard for me to keep straight what’s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the SAS Institute’s multi-year effort to get SAS integrated into various MPP DBMS, specifically Teradata, Netezza Twinfin(i), and Aster Data nCluster.

However, I chatted briefly Thursday with Michelle Wilkie, who is the SAS product manager overseeing all this (and also some other stuff, like SAS running on grids without being integrated into a DBMS). As best I understood, the story is:

I also took the opportunity to ask Michelle a question I’ve had a heck of a time getting answered: What’s the big-deal about in-database data mining scoring anyway? After all, the most common form of in-database data mining scoring is just to take a weighted sum of specific fields in a row, where the weights are the regression coefficients. You can do that in generic SQL, with performance that superficially should be at least as good as that for any alternative strategy. Michelle’s answers seemed to be twofold:

Edit: In response to this post, SAS wrote in with further clarification about in-database and/or MPP SAS.

Comments

10 Responses to “Clarifying the state of MPP in-database SAS”

  1. Paul Kent on May 7th, 2010 11:04 am

    Hey Curt; Keith is SAS’s CTO. That doesn’t stop us from telling Justin what we think he should do at NZ :-)

  2. Xiaobo.Gu on May 7th, 2010 11:48 am

    Have you heared anything about the SAS Greenplum in-database intergration?

  3. Steven Hillion on May 7th, 2010 1:29 pm

    Xiaobo:

    Many if not most of Greenplum’s customers already access Greenplum using the SAS/ACCESS interface. Moreover, we are working directly with SAS to enable in-database scoring. It’s on our product roadmap for this year, and we are actively discussing this with them right now.

    All this is obviously critical for us – there’s a huge number of SAS users out there who have a rich library of SAS code, and we need to make sure they can work seamlessly within our ‘data cloud’.

    It’s important to point out that our philosophy is that analysts need access to a wide variety of different tools and applications for doing deep analytics. This includes SAS, of course, and BI tools, as well as R, MapReduce and so on. More power to the modelers! And that’s not to forget SQL of course. In many cases, complex models or MapReduce jobs can be done with surprisingly simple SQL statements. (I’m happy to share some with you if you like!) In other cases, not. You need different tools for different jobs. And that includes SAS.

  4. Curt Monash on May 7th, 2010 4:46 pm

    Embarrassing typo re Keith Collins’ job fixed!

  5. Michael Wexler on May 8th, 2010 11:45 am

    In db scoring is great. What about in db modeling? To build a model on all data still requires that I extract and use an external box or system to model, sometimes with all data in RAM (if using R, for example). If I have to model on smaller samples because of resources, then I still have the ETL and constrained view I have with current systems.

    So, just scoring is a good first step… but I won’t consider the job done til I can build my models using the same parallel scalability.

  6. Steven Hillion on May 8th, 2010 12:34 pm

    Amen to that, Michael! That’s going to be a central focus for me and my team at Greenplum. We’ve already made pretty good progress (I can send you some samples if you drop me an email) and we’re working hard with engineering and with partners to develop more.

  7. The SAS Dummy on May 10th, 2010 8:58 am

    This is your database…on SAS…

    Curt Monash posted a nice summary of the current and planned offerings that help to make SAS analytics more available “in the database” — allowing you to analyze your data quickly without having to move it around so much.

    If you use SAS with Tera…

  8. Further clarifying in-database MPP SAS | DBMS2 -- DataBase Management System Services on May 15th, 2010 12:14 am

    […] My recent post about SAS’ MPP/in-database efforts was based on a discussion in a shared ride to the airport, and was correspondingly rough. SAS’ Shannon Heath was kind enough to write in with clarifications, and to allow me to post same. With permission, I’ve also made trivial grammar edits. […]

  9. Seth Grimes on May 17th, 2010 6:24 am

    Michelle Wilkie from SAS said, at the May 6 Aster Big Data Summit in Washington DC that Aster runs parallel instances of a SAS Data step on its nodes. I don’t recall her saying the following, but it would make sense: each instance would touch a subset of the overall data that the Data step would be manipulating with the results then recombined as needed or left in place, in the database. I believe she said the capability is shipping.

    The SAS Data step is very roughly similar to one SQL statement or a sequence of SQL statements wrapped in a procedural language. It joins tables and subselects rows and columns based on some criteria and allows various mathematical operations on them.

  10. Seth Grimes on May 19th, 2010 3:20 pm

    I’ve been told that the ability to run the SAS Data step within Aster nCluster is currently under development, not shipping.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.