My recent post about SAS’ MPP/in-database efforts was based on a discussion in a shared ride to the airport, and was correspondingly rough. SAS’ Shannon Heath was kind enough to write in with clarifications, and to allow me to post same. With permission, I’ve also made trivial grammar edits.
- Regarding Netezza, SAS Scoring Accelerator for Netezza currently supports Netezza Performance Data Server 4.6. Support for Netezza TwinFin is slated for July 2010.
- Regarding the AsterData nCluster, I can understand your confusion on the distinction between Limited Availability (LA) and General Availability (GA). To help clarify, Limited Availability means that the technology is available for pre-qualified customers who are also active SAS Enterprise Miner users. In those cases, the product has been through QA and is available for purchase by those limited pre-qualified customers, and support for a limited number of customer sites is available. When the product becomes Generally Available, all qualifying customers are able to take advantage.
- Regarding your question of general parallelism/in-database capability, SAS currently is taking advantage of two technologies to provide scalability and performance. The first, SAS In-Database, takes the processing to where the data resides and parallelizes the analytical computations by leveraging the MPP architecture. The second technology, SAS Grid Manager, parallelizes computational steps or subparts of a process across different nodes and bring the results back together.
- To further expand upon Michelle’s answer to your question “What’s the big deal about in-database data mining scoring anyway?”, hopefully these additional bullets will help:
- Manually converting the model scoring logic into SQL is difficult and time consuming in several instances. It would result in long hours and higher costs and would require testing and revalidation of the code. It will also restrict modelers to basic linear regression models. Automated model scoring (using SAS Scoring Accelerator) will allow modelers to use complex models like decision trees, neural networks, etc. (i.e. leverage full capabilities of SAS Enterprise Miner). (Note: Since I was comparing specialized in-database scoring to the in-database alternative of writing it all in SQL, this is actually the only one of the three bullets that address my question. )
- Scoring is typically done on a periodic basis – either daily, weekly, monthly or on an event driven basis and many of our customers are scoring large numbers of models against enormous tables. Conventional scoring of SAS models requires connecting to the database server to extract rows to SAS for scoring. The scores are commonly bulk loaded back to the database. As the number of rows in the table grows over time, network latency grows because the amount of data that is fetched from the database to the SAS scoring process increases. SAS Scoring Accelerator reduces the unnecessary data movement and replication for analytic (i.e. model deployment) processing.
- SAS Scoring Accelerator allows the scoring process to be linearly scalable, completely leveraging the power of the parallel shared-nothing architecture of the database or data warehouse in question.