October 10, 2007

SAS goes MPP on Teradata first

After a hurried discussion with SAS CTO Keith Collins and a followup with Teradata CTO Todd Walter, I think I’ve figured out the essence of the SAS port to Teradata. (Subtle nuances, however, have to await further research.) Here’s what I think is going on:

1. SAS is porting or creating two different products or modules, with two different names (and I don’t know exactly what those names are). The two different things they are porting amount to modeling (i.e., analysis) and scoring (i.e., using the results of the model for automated decision-making).

2. Both products are slated for delivery at or near the time of SAS 9.2, which is slated for GA at or near the middle of next year. (Maybe somebody from SAS could send me the official word, as well as product names and so on?)

3. The essence of the modeling port is a library of static UDFs (User Defined Functions).

4. The essence of the SAS scoring port is the ability to easily generate a single “dynamic” UDF to score according to a particular model. This would seem to leverage Teradata scoring-related enhancements much more than it would compete or conflict with them.

5. There are two different kinds of benefits SAS gets from integrating with an MPP (Massively Parallel Processing) DBMS. One is actual parallel processing of operations, shortening absolute calculation time dramatically, and also leveraging Moore’s Law without painful SMP (Symmetric MultiProcessing) overhead. The other is a radical reduction in data movement costs for the handoff between the database and the SAS software. Interestingly, SAS reports huge performance gains even from putting its software on a single node inside the Teradata grid. That is, changing how data movement is done is already a huge win, even when there’s no reduction in the overall amount moved. But of course, in the complete implementation, where database and SAS processing are done on the same nodes, there’s also a huge reduction in actual data movement effort required.

One obvious question would be: How hard would it be for SAS to replicate this work on other MPP DBMS? Well, at its core this work involves implementing a variety of elementary arithmetic and data manipulation functions. So a first-best guess is that a fairly efficient port would be easy (given that this one has already been performed), but that the last 20% or whatever of the performance optimizations require a lot more work. As to whether or not this is more than a theoretical question — well, both SAS and SPSS are disclosed members of the Netezza Developers Network. As for SMP DBMS — well, some of the work certainly could be replicated, but other important parts don’t even make sense on Oracle or Microsoft the way they do on Teradata, Netezza, DATAllegro, et al.

So what goes into that library of modeling UDFs? I have nothing like a complete list, but here are a couple of examples. First of all, a lot of modeling is about binning. You take a continuous variable, and establish a set of bins to cluster into, with the result being a nominal variable. E.g., you might group ages as <18, 18-25, 25-35, and so on. Then you may quickly do another test with ranges including 18-30, 30-35, 45-60, and so on, and see how the results differ. This kind of guessing and testing of bin parameters is a huge part of the modeling process, especially in consumer/clustering types of applications.

As another, more mathematically-oriented example area — sometimes before you even start an analysis, you want to throw out certain kinds of outlier data. Or you might want to replace it with calculated values (and the same goes for missing/null data). Basically, if you’re doing statistical analysis, various kinds of crude approximations can be helpful. And you can always salve your conscience about these inaccuracies by testing the final-result model against actual data and seeing how it performs.

One note on parallelizing those modeling UDFs — there was some actual refactoring involved. For the most obvious example, if you’re distributing a COUNT among a bunch of nodes, what do you do? Uh, you count on each node, and then in a separate step you add up the results. And of course the same general idea would apply to all sorts of aggregates, linear combinations, polynomial sums, and so on.

Comments

7 Responses to “SAS goes MPP on Teradata first”

  1. John M. Wildenthal on October 11th, 2007 3:08 pm

    This piqued my curiosity so I looked for a SAS White Paper on what they hope to achieve:

    EDIT BY CAM: THE LINK IS NOW A FEW COMMENTS DOWN.

    Last I heard, the 9.2 Early Adopter (beta) will be out in May ’08. Final releases normally follow 6 months after the Early Adopter release.

    I am starting to see why 9.2 has been delayed from Summer 2006.

  2. Curt Monash on October 11th, 2007 9:49 pm

    Thanks, John. That’s a great find.

    One point it makes that I’d been missing is data privacy. If the data never has to be moved outside the control of the DBMS, it’s a lot easier to provide appropriate levels of security for sensitive personal information.

  3. Curt Monash on November 6th, 2007 4:05 pm

    Whoops. I was going to refresh my memory about the SAS white paper cited in the first comment, but that link is now broken. 🙁

    CAM

  4. John M. Wildenthal on November 7th, 2007 3:11 pm
  5. Netezza update | DBMS2 -- DataBase Management System Services on September 11th, 2008 4:29 am

    […] the advanced analytics side, it sounds as if SAS integration akin to Teradata’s will happen sooner than any significant integration of Netezza’s own NuTech acquisition. […]

  6. SAS on Netezza and other Netezza extensibility | DBMS 2 : DataBase Management System Services on April 7th, 2011 11:40 pm

    […] support, notwithstanding SAS’ and Teradata’s apparent original intention of offering in-database modeling by now as […]

  7. เต็นท์ on October 18th, 2013 4:36 pm

    Remarkable! Its really remarkable piece of writing, I have
    got much clear idea concerning from this
    piece of writing.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.