After a hurried discussion with SAS CTO Keith Collins and a followup with Teradata CTO Todd Walter, I think I’ve figured out the essence of the SAS port to Teradata. (Subtle nuances, however, have to await further research.) Here’s what I think is going on:
1. SAS is porting or creating two different products or modules, with two different names (and I don’t know exactly what those names are). The two different things they are porting amount to modeling (i.e., analysis) and scoring (i.e., using the results of the model for automated decision-making).
2. Both products are slated for delivery at or near the time of SAS 9.2, which is slated for GA at or near the middle of next year. (Maybe somebody from SAS could send me the official word, as well as product names and so on?)
3. The essence of the modeling port is a library of static UDFs (User Defined Functions).
4. The essence of the SAS scoring port is the ability to easily generate a single “dynamic” UDF to score according to a particular model. This would seem to leverage Teradata scoring-related enhancements much more than it would compete or conflict with them.
5. There are two different kinds of benefits SAS gets from integrating with an MPP (Massively Parallel Processing) DBMS. One is actual parallel processing of operations, shortening absolute calculation time dramatically, and also leveraging Moore’s Law without painful SMP (Symmetric MultiProcessing) overhead. The other is a radical reduction in data movement costs for the handoff between the database and the SAS software. Interestingly, SAS reports huge performance gains even from putting its software on a single node inside the Teradata grid. That is, changing how data movement is done is already a huge win, even when there’s no reduction in the overall amount moved. But of course, in the complete implementation, where database and SAS processing are done on the same nodes, there’s also a huge reduction in actual data movement effort required.
One obvious question would be: How hard would it be for SAS to replicate this work on other MPP DBMS? Well, at its core this work involves implementing a variety of elementary arithmetic and data manipulation functions. So a first-best guess is that a fairly efficient port would be easy (given that this one has already been performed), but that the last 20% or whatever of the performance optimizations require a lot more work. As to whether or not this is more than a theoretical question — well, both SAS and SPSS are disclosed members of the Netezza Developers Network. As for SMP DBMS — well, some of the work certainly could be replicated, but other important parts don’t even make sense on Oracle or Microsoft the way they do on Teradata, Netezza, DATAllegro, et al.
So what goes into that library of modeling UDFs? I have nothing like a complete list, but here are a couple of examples. First of all, a lot of modeling is about binning. You take a continuous variable, and establish a set of bins to cluster into, with the result being a nominal variable. E.g., you might group ages as <18, 18-25, 25-35, and so on. Then you may quickly do another test with ranges including 18-30, 30-35, 45-60, and so on, and see how the results differ. This kind of guessing and testing of bin parameters is a huge part of the modeling process, especially in consumer/clustering types of applications.
As another, more mathematically-oriented example area — sometimes before you even start an analysis, you want to throw out certain kinds of outlier data. Or you might want to replace it with calculated values (and the same goes for missing/null data). Basically, if you’re doing statistical analysis, various kinds of crude approximations can be helpful. And you can always salve your conscience about these inaccuracies by testing the final-result model against actual data and seeing how it performs.
One note on parallelizing those modeling UDFs — there was some actual refactoring involved. For the most obvious example, if you’re distributing a COUNT among a bunch of nodes, what do you do? Uh, you count on each node, and then in a separate step you add up the results. And of course the same general idea would apply to all sorts of aggregates, linear combinations, polynomial sums, and so on.