My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
4. The term predictive analytics was invented or at least popularized by SPSS, some years before IBM bought the company. The industry eventually adopted the term. I prefer predictive modeling. It is fair to say that predictive modeling subsumes both statistical modeling and machine learning.
Nobody really knows exactly what data mining does or doesn’t include — the term is a poster child for Monash’s Third Law — but whatever it is, it seems central to the SAS and SPSS product lines. Simply using “data mining” as a synomyn for “predictive modeling” won’t lead you too far astray.
5. In that July 2 post I wrote:
I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
“Cluster in some way” can actually mean several things, for example:
- K-means or whatever.
- Ayasdi’s exotic approach to (quasi-)clustering.
- Decision trees.
The one thing it doesn’t mean is “scale out”, and I apologize for the ambiguity to whoever read it the wrong way.
This is often called ensemble modeling, except that I think — what a shock! — different people use the term somewhat differently.
5. Much of the difficulty and delay-to-value in predictive modeling comes from data preparation/feature selection — not so much the scripting of the ETL (Extract/Transform/Load), but rather choices about which variables to model on and, often, how to describe them. So it’s unsurprising that vendors sometimes tell me “Our tool is great because the data preparation is automagically handled”; I’ve heard that from companies as big as KXEN and as small as Simularity.
Typically, what’s going on is that they’ve come up with a particular approach to modeling that, among other virtues, has the short-time-to-value benefit. Well:
- Users may not want to replace their current modeling tools and associated business processes with a new one-trick pony.
- On the other hand, the new tools could accelerate the use of the old, just because of what they provide in feature selection and data prep. I.e., you do the best you can with the new tool, and that tells you what data to put into the your old one.
I think some KXEN users follow just such an approach.
6. I’ve spent a few hours talking with Ayasdi, and I’m still confused. But here are a few notes as best I understand things.
Company basics include:
- Innovative approach to predictive modeling, based on some advanced mathematics.
- ~50 people.
- ~20 paying customers.
- Verticals of financial services, bio/pharma, oil/gas (!), and government.
- Downtown Palo Alto. (I parked next to ClearStory and walked over to Ayasdi for my meeting.)
Buzz says Ayasdi has a heavy component of professional service in what it does. Ayasdi disputes this. Buzz also says Ayasdi is hot. I doubt Ayasdi disputes that part.
There’s some serious math involved in Ayasdi, but I’m skeptical about that aspect, for several reasons:
- I haven’t understood it yet.
- Ayasdi occasionally says things that are mathematically incorrect. (No, a topology does NOT assume an underlying metric space.)
- Advanced math and software rarely mix well. Even when the company does OK, the original advanced math claim tends to fade into the background. (E.g., support vector machines at KXEN, rough sets at Infobright.)
So I’ll just summarize Ayasdi’s math, as best I understand it, this way:
- Ayasdi uses a variety of different scoring functions to group your data into buckets.
- Ayasdi looks at which data points wind up in the same bucket several times, or in nearby ones.
- Users are encouraged to model separately — in most cases to date using tools and techniques from outside Ayasdi — on the most interesting of those sets of especially friendly data points.
- The whole thing could be viewed as inducing a covering of, say, the real line. Pretty pictures ensue based on the nerve of that covering and/or a kind of Reeb graph of any one of the scoring functions. (One of the many things I don’t understand about the math of Ayasdi is how those two possibilities dovetail together.)
7. I’m hearing a few more mentions of Mahout than I used to.
8. Skytree is accumulating some resources (money, people), but I haven’t talked with them.