I followed up with Keith Collins of SAS today about SAS-in-the-database, expanding on what I learned or thought I did when we talked last month. Here’s the scoop:
SAS users do a lot of data filtering, aka data preparation, in SAS. These have WHERE clauses, just like SQL. However, only some of them map to actual SQL WHERE clauses. SAS is now implementing many of the rest as UDFs (User-Defined Functions), one DBMS at a time, starting with Teradata. In addition, SAS users can write custom filters that get registered as UDFs. This capability will be released with SAS 9.2. (The timing on SAS 9.2 is in line with the comment thread to my prior post on SAS-in-the-DBMS.)
As a “wouldn’t it be nice” future, Keith talked about a generic SAS format for these extensions. Frankly, I’m not holding my breath. Getting SQL extensions adopted and robustly used is hard, even when they’re worthy (think of MODEL BY clauses or most of SQL/MM).
SAS users in many cases also wind up scoring data, using the same constructs that data filters do. Again, SAS is putting this in the DBMS, via generated UDFs, in the SAS 9.2 time frame. Benefits include performance and the saving of a recoding step. As with the filtering, Keith looks forward to a more generic format for this than UDFs, but my caveats above again apply.
Note: This all applies only to some SAS algorithms, not all. For example, an algorithm that involves a lot of iteration probably won’t run best within the confines of a DBMS.
So far we haven’t covered anything that is MPP-specific. But some further ideas are. First, while it’s not going to happen in 2008, SAS plans to refactor some of its algorithms – namely those that it is going to run in the database anyway – for massively parallel processing. (Counts and aggregations are obvious examples.) Second – and this is still just a research project — SAS would like to run a complete SAS server on a node of an MPP DBMS, at which point all those UDFs could be stored procedures instead.
While I had Keith on the phone anyway, I raised some general issues of SAS performance. Keith suggested that data mining/predictive analytics and forecasting rely on high volumes of data, which is consistent with the large fraction of data warehouse complex queries that are devoted to SAS extracts.
But then he went on to confuse me a bit On the one hand, he pointed out that optimization was classically much less about high volumes of data,and more lots of computations on the data. On the other, he mentioned the Kohl’s optimization project . Specifically, he cited it as an example of processing thousands of variables, and hence more columns than most DBMS today do a good job of supporting. From this I gathered that Kohl’s is using SAS’s own data management tool, SAS Intelligence Storage.
Finally, Keith expanded on that point, and opined that some major differences between SAS processing and OLTP lay behind a strong SAS/Intel relationship. Processors designed for OLTP assume short instruction pipelines, and replicate cache from process to process. But he prefers his chips to (also) be optimized for long instruction pipelines, in which data is passed sequentially through a long series of steps.
Keep getting great research about data warehousing, analytics, and related technologies. Get a FREE subscription by RSS or email!