Until I did all this recent research on data warehousing, I didn’t realize just how big a role data mining plays in driving the whole thing. Basically, there are three things you can do with a data warehouse – classical BI, “operational” BI, and data mining. If we’re talking about long-running queries, that’s not operational BI, and it’s not all of classical BI either. The rest is data mining. Indeed, if you think back to what you know of the customer bases at data warehouse appliance vendors Netezza and DATallegro, there are a lot of credit-reporting-data types of users – i.e., data miners. And it’s hard to talk about uses for those appliances very long without SAS extracts and the like coming up.
That was just the analysis. There’s also data mining scoring. In data mining scoring you substitute numbers for values in a table, and then do a row-by-row weighted sum of what results. Or else you do this real-time, for single rows, if that’s your preferred way of deploying things. Just about everybody agrees this is better done “in the DBMS” than in an extract file. Indeed, since the batch version of this is table-scan-to-the-max, scoring turns out to be ideally suited for data warehouse appliances and other MPP/shared-nothing products. (That doesn’t – and shouldn’t – stop Oracle from making scoring integration part of its data mining value-added pitch.)
I further think scoring could be particularly well suited for FPGA-based pattern matching. But I’m not aware of Netezza doing anything in this direction. On the other hand, they may think that the huge projections inherent in data mining scoring – i.e., as a byproduct of variable reduction – mean that the FPGA is anyhow already more than pulling its weight.
The wild card here is the attempt by companies like KXEN and Verix to change the rules of data mining. (Verix currently runs on Oracle, by the way.) KXEN, in particular, would like data mining to be done in a lot more, but probably a lot smaller, processing runs than it is today.