November 28, 2011

Agile predictive analytics — the “easy” parts

I’m hearing a lot these days about agile predictive analytics, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to using data as quickly as reasonably possible. But discussing particulars is hard, for several reasons:

At least three of the generic arguments for agility apply to predictive analytics:

But the reasons to want agile predictive analytics don’t stop there.

Not only is it hard to get stuff right on the first try for a given information set, but the available information can also change quickly. For example:

What’s more, often you deliberately don’t want to test, model, or tune all your variables at once. First you determine whether the ad text should be “Would you be so kind as to allow us to supply you with our wares?” or “Buy it, dude!”; only afterwards do you decide whether the color scheme should rely on red or green.

With that as backdrop, how can you make your predictive analytics more agile? Let’s start by breaking predictive analytics into four pieces:

Only the second of those has much excuse for being an agility bottleneck; the other three are well addressed by technology you can buy (or straightforwardly build) today.

The deployment part of the story can be pretty simple, at least technically — spit out some PMML (Predictive Modeling Markup Language), and if you’re deploying to a DBMS with good enough PMML support, you’re good to go. Any vendor who doesn’t offer that degree of simplicity had better be working toward it fast. That said, your applications that are infused with predictive analytics need to be modular enough to accommodate model changes; if not, some refactoring lies ahead. And the same can be said for the work processes that surround them.

The data mustering parts should be pretty straightforward too. Setting up a relational data mart tuned for investigative analytics isn’t all that hard or costly (perhaps unless your database is enormously large), and the same actually goes for a Hadoop cluster. Beyond that, if you can model and deploy from the same database, that’s great; if not, you have an ETL (Extract/Transform/Load) need. I guess you could have data quality/MDM (Master Data Management) issues as well, but offhand I’m not seeing why you wouldn’t push their solutions back to analysis time. And any decent analytic technology stack can give sub-hour latency; while that may not suffice from all standpoints, it’s plenty fast enough for analysis-time agility.

With those preliminaries out of the way, now let’s turn to the heart of the agile predictive analytics challenge.

Comments

15 Responses to “Agile predictive analytics — the “easy” parts”

  1. Agile predictive analytics – the heart of the matter : DBMS 2 : DataBase Management System Services on November 28th, 2011 2:41 pm

    [...] already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic [...]

  2. Hari on November 28th, 2011 10:09 pm

    Curt,

    I’m not sure if the ETL/MDM parts are easy. At my last job, we were gathering data from many source systems that were constantly changing. The source systems introduced new features, and we wanted the new data asap into our analytic mart. Accepting the data wasn’t not a big deal, but presenting the varying input in a familiar/constant format to the data-miners was hard.

    I feel it’s getting easier for app-developers to add features and get more types of data.

    So how is an analytic-mart project-manager to keep up? Should he stop value-add/dimensional-modeling-etl-stuff and make the raw data available in some schema-less/schema-agnostic/as-is format to the data-miners?

    ALSO: I feel like this discussion is related to Joe Hellerstein’s MAD concept: http://databeta.wordpress.com/2009/03/20/mad-skills/

    Hari

  3. Thomas W Dinsmore on November 29th, 2011 7:58 am

    Excellent post. While PMML may be the solution for deployment in the future, at present it is at best a partial solution:

    (1) The standard is incomplete, and does not include newer or highly specialized algorithms
    (2) Vendor support for PMML import and export is incomplete

    The most widely used commercial data mining platform currently supports PMML export for about half of its algorithms. C or Java export is a better choice at present.

  4. Curt Monash on November 29th, 2011 8:19 am

    Hari,

    I keep trying to forget the MAD skills paper, and people keep bringing it up. :(

  5. Curt Monash on November 29th, 2011 8:21 am

    Hari,

    Good point about the rapidly changing schemas. That said, it’s not too hard to get all the data into one place even in such cases. The problem is that the contents of the place keep changing, and the modelers have to be willing to roll with the changes.

    There’s no way to get around the fact that if you have new and better information now, then what you did before probably isn’t optimal.

  6. Thomas W Dinsmore on November 29th, 2011 9:53 am

    Two observations concerning the data mustering problem:

    (1) Organizations with extensive data management operations run by the analytics team tend to have a history of poor communication between the enterprise data management organization and the businesses that consume analytics. Analysts in these organizations are doing what enterprise IT won’t do or can’t do.

    (2) There are actually two types of data transformation done by analysts: (a) cleaning up messy, inconsistent or incomplete data; and (b) enhancing the data with mathematical transforms that improve model performance. In an ideal world, analysts should never have to do (a); refer to point (1), above. On the other hand, analysts almost always have to do (b). This is significant, because if the predictive model depends on type (b) transformations, the same transformation must be replicated in the deployment environment. For that reason alone, deployment is more complicated than slapping a model into PMML, since PMML does not capture the transformations.

    You can argue that analysts should avoid type (b) transformations and accept lower model lift, but it’s a tough sell. Like telling a customer who asks for a Lexus that a Toyota will work just as well.

  7. Curt Monash on November 29th, 2011 11:19 am

    Thomas,

    If getting analytics done involves storing derived data along the way, I’m all for it.

    I’m also all for having specialists who get you the best possible results, most efficiently, in terms of model accuracy and computing performance alike.

    I’m just saying that in most cases departments need the ability to do quick-and-dirty modeling on their own, rather than leaving all conceptual choices to siloed statistical specialists.

    Most of the exceptions I envision occur when there’s really only one department relying on predictive analytics anyway or, at the other extreme, when a particular department has a predictive analytics staff so large it almost is like an IT shadow silo.

  8. Thomas W Dinsmore on November 29th, 2011 11:26 am

    Dinsmore’s First Law of Analytics — the degree to which organizations are willing to trust anyone to do predictive modeling is inversely proportional to the business impact of the predictions.

    You can bet that banks aren’t going to trust anyone to build models for Basel II compliance, risk or fraud.

    I’m not clear on the use case for “quick and dirty” predictions. There is definitely a need for businesspeople to get their hands on data for crosstabulation, frequency distributions and so forth. But predictive analytics are a different animal

  9. Curt Monash on November 29th, 2011 12:47 pm

    Example: Red clothing sells in Omaha and orange clothing sells in Syracuse, for obvious reasons of football fandom. Wal-Mart figured this out in the 1970s, running weekly reports in the technology of the day.

    But can we build on that to come up with a “local sports fan” profile? What can we do with that profile if it starts to work?

    Perhaps better example: I go get some social media data, telling me who said what. I mine it to develop hypotheses about what I could then sell them, and how, and test those. It works. Bad news: Not enough individuals in the social-media-talkative sample to matter. Good news: I nonetheless developed insights that will help me sell more effectively to the silent majority.

  10. Thomas W Dinsmore on November 29th, 2011 1:03 pm

    The other bad news — about 90% of the data produced by social media scrapers is garbage, so before you develop that hypothesis you will spend three months removing spam and bogus material. That’s why people who actually make money from social media analytics — like Digitas — still deploy human analysts to sort through the garbage by hand.

  11. John Ball on December 1st, 2011 10:48 pm

    Curt, great post and many good comments from your readers. I like the term “data mustering” although I’d break it into 3 components:

    a/ cleaning up messy, inconsistent, or incomplete data;
    b/ augmenting data with derived attributes coming from the business domain expertise that improve model performance;
    c/ enhancing the data with mathematical transforms that improve model performance

    And for those 3 components:
    a/ is not just about ETL, it requires 2 types of actions:
    a1/ structural cleanup: dedup, cleanup of IDs, associating cookies to customer IDs from CRM etc). You’re doomed if you have the wrong data to perform joins, the wrong IDs etc.
    a2/other types of cleanup (such as missing values, out-of-range values, outliers). This should be handled by robust modeling techniques.

    b/ augmenting the data with derived data from business domain knowledge: e.g. knowing that the number of calls for last week during work hours is a churn predictor for a Telco should be obtained through a collaboration between business users and data broker, but the business should be driving. And you should probably have a framework that allows reuse and broad definition of the derived data so that it becomes an IP asset of the company and not a bespoke project every time you have a new business question.
    c/ enhancing through mathematical transforms: This can be managed directly by the modeling technology (automation): e.g. knowing that you have to encode a variable in such a way to improve the model lift.

    Data miners spend an inordinate amount of time on “data mustering” (for both modeling and deployment). So I am not so sure I would agree with the statement that it isn’t a bottleneck to agility in today’s market. Of course, we have a pretty strong view on the right approach, but I’ll spare you a direct plug for our products here ☺

  12. Curt Monash on December 1st, 2011 11:38 pm

    John,

    I did NOT mean for “data mustering” to equate to “data preparation”. Rather, I meant for it to equate more to data warehousing.

  13. Comments on SAS : DBMS 2 : DataBase Management System Services on February 8th, 2012 5:51 pm

    [...] SAS has not been strong in helping its users do agile predictive analytics. [...]

  14. Agile Analytics: Overview « Port Fortune on November 20th, 2012 3:51 pm

    [...] Monash recently published an excellent two-part blog on the subject (here and [...]

  15. What matters in investigative analytics? | DBMS 2 : DataBase Management System Services on October 10th, 2013 7:58 am

    [...] an obvious demand for agile predictive analytics. But if agility were all that mattered, KXEN — which excels in agility — would probably [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.