Predictive modeling and advanced analytics
Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
Following up on my notes on predictive modeling post from three weeks ago, I’d like to tackle some areas of recurring confusion.
Why are we modeling?
Ultimately, there are two reasons to model some aspect of your business:
- You generally want insight and understanding.
- This is analogous to why you might want to do business intelligence.
- It commonly includes a search for causality, whether or not “root cause analysis” is exactly the right phrase to describe the process.
- You want to do calculations from the model to drive wholly or partially automated decisions.
- A big set of examples can be found in website recommenders and personalizers.
- Another big set of examples can be found in marketing campaigns.
- For an example of partial automation, consider a tool that advises call center workers.
How precise do models need to be?
Use cases vary greatly with respect to the importance of modeling precision. If you’re doing an expensive mass mailing, 1% additional accuracy is a big deal. But if you’re doing root cause analysis, a 10% error may be immaterial.
Who is doing the work?
It is traditional to have a modeling department, of “data scientists” or SAS programmers as the case may be. While it seems cool to put predictive modeling straight in the hands of business users — some business users, at least — it’s rare for them to use predictive modeling tools more sophisticated than Excel. For example, KXEN never did all that well.
That said, I support the idea of putting more modeling in the hands of business users. Just be aware that doing so is still a small business at this time.
“Operationalizing” predictive models
The topic of “operationalizing” models arises often, and it turns out to be rather complex. Usually, to operationalize a model, you need: Read more
A common marketing theme in the 2010s decade has been to claim that you make analytics available to many business users, as opposed to your competition, who only make analytics available to (pick one):
- Specialists (with “PhD”s).
- Fewer business users (a thinner part of the horizontally segmented pyramid — perhaps inverted — on your marketing slide, not to be confused with the horizontally segmented pyramids — perhaps inverted — on your competition’s marketing slides).
Versions of this claim were also common in the 1970s, 1980s, 1990s and 2000s.
Some of that is real. In particular:
- Early adoption of analytic technology is often in line-of-business departments.
- Business users on average really do get more numerate over time, my three favorite examples of that being:
- Statistics is taught much more in business schools than it used to be.
- Statistics is taught much more in high schools than it used to be.
- Many people use Excel.
Even so, for most analytic tools, power users tend to be:
- People with titles or roles like “business analyst”.
- More junior folks pulling things together for their bosses.
- A hardcore minority who fall into neither of the first two categories.
Asserting otherwise is rarely more than marketing hype.
- “Freeing business analysts from IT” (August, 2014)
Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:
- Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
- Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.
Key aspects include:
- Datameer runs straight against Hadoop.
- Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
- The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
- Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
- Datameer lets you write derived data back into Hadoop.
|Categories: Business intelligence, Databricks, Spark and BDAS, Datameer, Hadoop, Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Web analytics||5 Comments|
As planned, I’m getting more active in predictive modeling. Anyhow …
1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.
2. The most controversial part of that post was probably the claim:
I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- It is always possible to instead go with a single model formally.
- A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
- Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.
3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more
|Categories: Ayasdi, Databricks, Spark and BDAS, Log analysis, Nutonian, Predictive modeling and advanced analytics, Revolution Analytics, Scientific research, Web analytics||5 Comments|
I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:
Tez vs Spark = Apples vs Oranges.
Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.
Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it
[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN
That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.
- The Twitter discussion with Shaun was a spin-out from my research around streaming for Hadoop.
|Categories: Data warehousing, Databricks, Spark and BDAS, Hadoop, Hortonworks, Predictive modeling and advanced analytics||6 Comments|
1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?
2. A few thoughts on cloud, colocation, etc.:
- The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
- The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
- Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
- I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
- I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
- In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
- Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
- Cloud hype is of course still with us.
- Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.
3. As for the analytic DBMS industry: Read more
Everybody is confused about privacy and surveillance. So I’m renewing my efforts to consciousness-raise within the tech community. For if we don’t figure out and explain the issues clearly enough, there isn’t a snowball’s chance in Hades our lawmakers will get it right without us.
How bad is the confusion? Well, even Edward Snowden is getting it wrong. A Wired interview with Snowden says:
“If somebody’s really watching me, they’ve got a team of guys whose job is just to hack me,” he says. “I don’t think they’ve geolocated me, but they almost certainly monitor who I’m talking to online. Even if they don’t know what you’re saying, because it’s encrypted, they can still get a lot from who you’re talking to and when you’re talking to them.”
That is surely correct. But the same article also says:
“We have the means and we have the technology to end mass surveillance without any legislative action at all, without any policy changes.” The answer, he says, is robust encryption. “By basically adopting changes like making encryption a universal standard—where all communications are encrypted by default—we can end mass surveillance not just in the United States but around the world.”
That is false, for a myriad of reasons, and indeed is contradicted by the first excerpt I cited.
What privacy/surveillance commentators evidently keep forgetting is:
- There are many kinds of privacy-destroying information. I think people frequently overlook just how many kinds there are.
- Many kinds of organization capture that information, can share it with each other, and gain benefits from eroding or destroying privacy. Similarly, I think people overlook just how pervasive the incentive is to snoop.
- Privacy is invaded through a variety of analytic techniques applied to that information.
So closing down a few vectors of privacy attack doesn’t solve the underlying problem at all.
Worst of all, commentators forget that the correct metric for danger is not just harmful information use, but chilling effects on the exercise of ordinary liberties. But in the interest of space, I won’t reiterate that argument in this post.
Perhaps I can refresh your memory why each of those bulleted claims is correct. Major categories of privacy-destroying information (raw or derived) include:
- The actual content of your communications – phone calls, email, social media posts and more.
- The metadata of your communications — who you communicate with, when, how long, etc.
- What you read, watch, surf to or otherwise pay attention to.
- Your purchases, sales and other transactions.
- Video images, via stationary cameras, license plate readers in police cars, drones or just ordinary consumer photography.
- Monitoring via the devices you carry, such as phones or medical monitors.
- Your health and physical state, via those devices, but also inferred from, for example, your transactions or search engine entries.
- Your state of mind, which can be inferred to various extents from almost any of the other information areas.
- Your location and movements, ditto. Insurance companies also want to put monitors in cars to track your driving behavior in detail.
|Categories: Health care, Predictive modeling and advanced analytics, Surveillance and privacy, Telecommunications||2 Comments|
I’ve talked with many companies recently that believe they are:
- Focused on building a great data management and analytic stack for log management …
- … unlike all the other companies that might be saying the same thing …
- … and certainly unlike expensive, poorly-scalable Splunk …
- … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.
At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.
Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.
A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: Read more
Many of the companies I talk with boast of freeing business analysts from reliance on IT. This, to put it mildly, is not a unique value proposition. As I wrote in 2012, when I went on a history of analytics posting kick,
- Most interesting analytic software has been adopted first and foremost at the departmental level.
- People seem to be forgetting that fact.
In particular, I would argue that the following analytic technologies started and prospered largely through departmental adoption:
- Fourth-generation languages (the analytically-focused ones, which in fact started out being consumed on a remote/time-sharing basis)
- Electronic spreadsheets
- 1990s-era business intelligence
- Fancy-visualization business intelligence
- Predictive analytics
- Text analytics
- Rules engines
What brings me back to the topic is conversations I had this week with Paxata and Metanautix. The Paxata story starts:
- Paxata is offering easy — and hopefully in the future comprehensive — “data preparation” tools …
- … that are meant to be used by business analysts rather than ETL (Extract/Transform/Load) specialists or other IT professionals …
- … where what Paxata means by “data preparation” is not specifically what a statistician would mean by the term, but rather generally refers to getting data ready for business intelligence or other analytics.
Metanautix seems to aspire to a more complete full-analytic-stack-without-IT kind of story, but clearly sees the data preparation part as a big part of its value.
If there’s anything new about such stories, it has to be on the transformation side; BI tools have been helping with data extraction since — well, since the dawn of BI. Read more
|Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics, Progress, Apama, and DataDirect||10 Comments|