May 28, 2012

Quick-turnaround predictive modeling

Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.

Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling:

A KXEN story along these lines might go:

The point here is KXEN automates some modeling steps that are manual with most other tools, allowing each individual model to be done more quickly.

One production Revolution use case goes:

R (in Revolution’s version or any other that I know of) doesn’t have KXEN’s general quick-modeling features, and perhaps not even those of SAS or SPSS. But building a specific parameterized app is obviously a workaround for that lack.

That said, there are indeed a lot of cases where you often need to re-run your models from scratch, whether through convenient technology or by throwing lots of bodies at the problem. Suppose, for example, you’re doing some kind of marketing campaign management for a telecom service provider. Potential changes to your data, or to its interpretation, include:

and also:

Any of these changes (and that’s hardly a complete list) could invalidate your existing models, or otherwise make it advantageous for you to run new ones.

Of course, “from scratch” is not necessarily entirely from scratch; while each model may be new, the underlying database is likely to change more slowly. It’s hard to do quick-turnaround predictive modeling unless you start out with a database that’s in good shape — even when one of the reasons for the quick turnaround need is that you keep adding new kinds data.

One last note — little of this is in the vein “BI has told us something interesting; now let’s start modeling.” The step from operational/monitoring business intelligence to drilldown/investigative BI happens all the time, but I’m not aware of many cases (yet) where there’s a follow-on step of quick-turnaround predictive modeling. Even when modeling is done quickly, it seems to be proactive much more than reactive — or if it is reactive, it’s reactive to big news (stock market crash, natural disaster, whatever) rather than to, say, a few surprising sales results.

The time may (and should) come when iterative investigate BI and iterative predictive analytics go hand-in-hand, but — presumably with a few exceptions I’m overlooking — that natural-seeming synergy doesn’t seem to be exploited much today.

Comments

10 Responses to “Quick-turnaround predictive modeling”

  1. J. Andrew Rogers on May 28th, 2012 1:25 pm

    An issue for some types of quick-turnaround predictive modeling is that many databases capture interpretations of data rather than the “physics” of the data. The database representation tacitly assumes an interpretation that may be poor for some analytics.

    The distinction between storing the “physics” of data and an interpretation is a big deal when working with sensing data, and is generally useful when fusing many different data sources. Geospatial geometries do not exist in a cartographic projection even if that is how they are presented to a user. A photograph is a collection of spectral samples from which countless features can be derived. Enumerations and number ratings are often a dimensional reduction that summarizes a more complex set of factors that could be captured in text fields. Data models often store what the user expects to see rather than what that presentation actually represents for analytical purposes.

    Hence the term “physics”, since it nominally captures the objective fact of reality underlying the interpretation and presentation and makes the assumption that this is a unifying context across diverse data sources. It is not efficient for fixed, narrow use cases but it is analytically flexible.

  2. Curt Monash on May 28th, 2012 7:36 pm

    Hiya!

    I write about derived or cooked data a lot.

    But you’re right — it’s a relevant point here as well. 🙂

  3. J. O'Brien on May 29th, 2012 3:48 am

    If agile predictive analytics is preferred or not, I’m interested in your thoughts on the significance of model version management. Updating models based on changes in the real world makes sense but does maintaining a model’s version and relevant time frame provide additional value or lower risks in any way?

  4. Curt Monash on May 29th, 2012 5:15 am

    Data provenance is generally a good thing, especially for derived data. That’s still in its infancy. I guess model version management could be one way of addressing that need.

    Also, all code should be version managed, and models are code.

    If the outputs of models were used much in BI, we’d have yet another reason for model version management, but to this point they’re aren’t very often.

    All that said — we have data, which includes the actual responses to our actual campaigns. So model version management feels more like a nice-to-have than a must-have.

  5. Thomas W Dinsmore on May 29th, 2012 8:50 am

    Couple of comments:

    (1) I’ve yet to see anyone question the value of reducing cycle time for predictive analytics. All other things equal, a dollar today is better than a dollar tomorrow.

    (2) The bottlenecks in predictive modeling are on the front end and back end, in data marshalling and deployment. KXEN’s claims to automation don’t fix either of those problems.

    (3) Concerning deployment, it’s worth noting that since Revolution R Enterprise can run inside an IBM Netezza appliance, model scoring is a straightforward process.

    (4) A single model with a thousand subsegments is complex. Managing a thousand models is also complex.

    (5) It’s possible to product individual-level predictions without building individual models. In the real world of predictive analytics, practicioners understand that additional complexity does not necessarily add to predictive power; segmenting a population only adds to overall predictive power if the relationship between the dependent variable and independent variables is significantly different across the segments. More often than not, observed differences are simply noise.

    (6) For deployed models. the “versioning” problem is a matter of tracking how well each model performs by matching predicted and actual measures. As the number of deployed models increases, the time and cost to do this becomes prohibitive; organizations with rapid deployment processes in place simply rebuild and publish the models on a regular cycle.

  6. John Ball on May 29th, 2012 10:52 pm

    Thomas,

    We agree that many of the bottlenecks are in data prep and manipulations (I assume what you call marshaling) and deployment. That’s why for data prep and manipulation we have a dedicated product to automate many (not all) of the steps (http://www.kxen.com/Products/Explorer) and for deployment we’ve been doing in database scoring for years (as well as outputting C++, Java, yes even SAS etc.). Flexibility in both is key, because the plumbing and horsepower required when doing massive scoring on 50 million subscribers is different than when calculating the “next best thing” in a call center. We also realize that there are many folks who use SAS, SPSS, or homegrown SQL to prepare their analytical data sets, so we can directly read in many formats like SAS, SPSS etc.

    That said, we still encounter plenty of companies where analysts spend weeks (sometimes months) on building a model (I’m counting post data marshaling to start of deployment). Most surveys I have seen from industry analysts estimate it at 20 to 35% of total time spent. So it is incorrect to ignore the time spent on modeling in the equation, just as it is overly simplistic to think that “modeling automation” solves all the problems for agility.

    Regarding the number of models, I think there is some confusion. We do not always advocate for lots of models. We say that you should have the flexibility to build the number that matches your business needs. And sometimes, you simply have to build many models or you are not accurately describing the problem. The simplest example is where a company has different data available by line of business and product line (corporate, enterprise, consumer, pre-paid, post paid, etc.). That is not the analysts fault, it is just the reality of their business systems.

    At the end of the day, predictive power can be measured by the results obtained. And if you get better business results (more churners identified, more products sold, more email campaigns opened etc.) with less models, then great. Our experience is that we often produce incremental gains over previous results by taking a more granular and agile approach. Not always, but very regularly.

    I will contact you offline so we can have a longer discussion.

    John

  7. Thomas W Dinsmore on May 30th, 2012 11:45 pm

    In the contemporary world, a modeler who spends months building a model will soon be an ex-modeler.

  8. John M. Wildenthal on May 31st, 2012 12:53 pm

    Re:

    “(6) For deployed models. the “versioning” problem is a matter of tracking how well each model performs by matching predicted and actual measures. As the number of deployed models increases, the time and cost to do this becomes prohibitive; organizations with rapid deployment processes in place simply rebuild and publish the models on a regular cycle.”

    This makes me uncomfortable from an “unclosed virtuous cycle” aspect.

    I guess one might respond to questions about unmonitored model performance by claiming that since the model is [always] recent, it is probably performing as well as could be. But that leaves open the question of quantifying the benefit. Which leaves open the value of the model – and modeler – to the firm. If you never go back and check, how do you prove the model is doing enough better than chance that you justify the costs of creating and using the model in the first place? Maybe the firm would be better off firing the modeler and sending random mailings given a low actual lift. Wouldn’t we rather be showing hard numbers to our managers during our performance reviews?

    As for the cost, if you are creating and using models in a regulated industry you will do whatever your regulator tells you to do, including whatever performance monitoring they require. Perhaps the cost of monitoring performance should be included in the cost of deployment? And isn’t performance measurement pretty straightforward BI after adding just a few more variables to your DW, the same DW that supports quick-turnaround modeling in the first place?

    I don’t think the cost of monitoring need be high, particularly if the DW is set up well for creating the models in the first place. And it is a good business practice to get a reasonable measure of costs and benefits for post hoc evaluation. If the perceived performance benefits of a model are so small that they are dwarfed by the cost of measuring the performance, maybe the firm doesn’t really benefit from creating/having/using that model.

  9. Thomas W Dinsmore on June 1st, 2012 5:42 pm

    The conventional way to monitor a model is to run a K-S test (or similar statistic) to measure model drift. That’s easy to do when you have one deployed model, but not so easy to do when you have a thousand deployed models.

    As the number of deployed models increases and the time needed to re-estimate and re-deploy models declines, it becomes more cost effective to simply re-estimate models on a regular cycle than to evaluate each model separately. You are never worse off if you do this. It’s like washing the car every Saturday whether it needs it or not.

  10. SAP is buying KXEN | DBMS 2 : DataBase Management System Services on September 11th, 2013 12:30 pm

    […] that was already old news back in 2006, and KXEN had pivoted to a simpler and more automated modeling approach. Presumably, this ease of modeling was part of the reason for KXEN’s […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.