Predictive modeling and advanced analytics

Discussion of technologies and vendors in the overlapping areas of predictive analytics, predictive modeling, data mining, machine learning, Monte Carlo analysis, and other “advanced” analytics.

June 13, 2013

How is the surveillance data used?

Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:

Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.

Consider the example of Tamerlan Tsarnaev:

In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.

While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.

As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.

As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean. Read more

Categories: Application areas, Predictive modeling and advanced analytics, RDF and graphs, Surveillance and privacy, Text

9 Comments

June 2, 2013

WibiData and its Kiji technology

My clients at WibiData:

Think they’re an application software company …
… but actually are talking about what I call analytic application subsystems.
Haven’t announced or shipped any of those either …
… but will shortly.
Have meanwhile shipped some cool enabling technology.
Name their products after sushi restaurants.

Yeah, I like these guys. 🙂

If you’re building an application that “obviously” calls for a NoSQL database, and which has a strong predictive modeling aspect, then WibiData has thought more cleverly about what you need than most vendors I can think of. More precisely, WibiData has thought cleverly about your data management, movement, crunching, serving, and integration. For pure modeling sophistication, you should look elsewhere — but WibiData will gladly integrate with or execute those models for you.

WibiData’s enabling technology, now called Kiji, is a collection of modules, libraries, and so on — think Spring — running over Hadoop/HBase. Except for some newfound modularity, it is much like what I described at the time of WibiData’s launch or what WibiData further disclosed a few months later. Key aspects include:

A way to define schemas in HBase, including ones that change as rapidly as consumer-interaction applications require.
An analytic framework called “Produce/Gather”, which can execute at human real-time speeds (via its own execution engine) or with higher throughput in batch mode (by invoking Hadoop MapReduce).
Enough load capabilities, Hive interaction, and so on to get data into the proper structure in Kiji in the first place.

Categories: Hadoop, HBase, NoSQL, Open source, Predictive modeling and advanced analytics, WibiData

5 Comments

May 20, 2013

Some stuff I’m working on

1. I have some posts up on Strategic Messaging. The most recent are overviews of messaging, pricing, and positioning.

2. Numerous vendors are blending SQL and JSON management in their short-request DBMS. It will take some more work for me to have a strong opinion about the merits/demerits of various alternatives.

The default implementation — one example would be Clustrix’s — is to stick the JSON into something like a BLOB/CLOB field (Binary/Character Large Object), index on individual values, and treat those indexes just like any others for the purpose of SQL statements. Drawbacks include:

You have to store or retrieve the JSON in whole documents at a time.
If you are spectacularly careless, you could write JOINs with odd results.

IBM DB2 is one recent arrival to the JSON party. Unfortunately, I forgot to ask whether IBM’s JSON implementation was based on IBM DB2 pureXML when I had the chance, and IBM hasn’t gotten around to answering my followup query.

3. Nor has IBM gotten around to answering my followup queries on the subject of BLU, an interesting-sounding columnar option for DB2.

4. Numerous clients have asked me whether they should be active in DBaaS (DataBase as a Service). After all, Amazon, Google, Microsoft, Rackspace and salesforce.com are all in that business in some form, and other big companies have dipped toes in as well. Read more

Categories: Amazon and its cloud, Cloud computing, Columnar database management, Google, IBM and DB2, Microsoft and SQL*Server, Predictive modeling and advanced analytics, Software as a Service (SaaS), Structured documents, Surveillance and privacy

5 Comments

April 25, 2013

Analytic application themes

I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.

1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.

Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.

Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:

There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.

2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:

Customer interaction
Network and sensor monitoring
Game and mobile application back-ends

Also arising fairly frequently are:

Algorithmic trading
Anti-fraud
Risk measurement
Law enforcement/national security
Healthcare
Stakeholder-facing analytics

I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.

Categories: Aerospike, Application areas, Business intelligence, Cloudera, Games and virtual worlds, GIS and geospatial, Health care, Investment research and trading, Log analysis, MemSQL, Platfora, Predictive modeling and advanced analytics, Telecommunications, Web analytics, WibiData

2 Comments

February 22, 2013

Should you offer “complete” analytic applications?

WibiData is essentially on the trajectory:

Started with platform-ish technology.
Selling analytic application subsystems, focused for now on personalization.
Hopeful of selling complete analytic applications in the future.

The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post. Read more

Categories: Hadapt, HBase, Market share and customer counts, PivotLink, Predictive modeling and advanced analytics, WibiData

5 Comments

February 13, 2013

It’s hard to make data easy to analyze

It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.

Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:

“We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- Splunk.
“Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- Hadoop.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- Splunk.
“Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
- Splunk.

*Complex event/stream processing terminology is always problematic.

My thoughts on all this start: Read more

Categories: Business intelligence, Data warehousing, Derived data, EAI, EII, ETL, ELT, ETLT, Hadoop, In-memory DBMS, Investment research and trading, Memory-centric data management, Microsoft and SQL*Server, MOLAP, NoSQL, Predictive modeling and advanced analytics, salesforce.com, Splunk, Streaming and complex event processing (CEP), Text

6 Comments

December 9, 2012

Amazon Redshift and its implications

Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:

Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
ParAccel did some engineering to make its DBMS run better in the cloud.
Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.

“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.

The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.

So who should and will use Amazon Redshift? For starters, I’d say: Read more

Categories: Amazon and its cloud, Business intelligence, Cloud computing, Data mart outsourcing, Data warehousing, Infobright, ParAccel, Predictive modeling and advanced analytics, Pricing, Vertica Systems

6 Comments

December 9, 2012

ParAccel update

In connection with Amazon’s Redshift announcement, ParAccel reached out, and so I talked with them for the first time in a long while. At the highest level:

ParAccel now has 60+ customers, up from 30+ two years ago and 40ish soon thereafter.
ParAccel is now focusing its development and marketing on analytic platform capabilities more than raw database performance.
ParAccel is focusing on working alongside other analytic data stores — relational or Hadoop — rather than supplanting them.

There wasn’t time for a lot of technical detail, but I gather that the bit about working alongside other data stores:

Is relatively new.
Works via SELECT statements that reach out to the other data stores.
Is called “on-demand integration”.
Is built in ParAccel’s extensibility/analytic platform framework.
Uses HCatalog when reaching into Hadoop.

Also, it seems that ParAccel:

Is in the early stages of writing its own analytic functions.
Bundles Fuzzy Logix and actually has some users for that.

Categories: Amazon and its cloud, Cloud computing, Data warehousing, Hadoop, Market share and customer counts, ParAccel, Predictive modeling and advanced analytics, Specific users

5 Comments

November 9, 2012

Analytic application subsystems

Imagine a website whose purpose is to encourage consumers to take actions — for example to click on an ad, click on the next page, or actually make a purchase. Best practices for such a site include:

An ever-evolving user experience, informed by — among other factors — creativity, brand identity, the vendor’s evolving product line itself, and …
… predictive modeling.
Personalization based on predictive modeling.

Those predictive models themselves will keep changing, because:

Organizations learn.
Consumer tastes change.
More or different kinds of data keep becoming available.

In that situation, what would it mean to offer the website owner a predictive modeling “application”? Read more

Categories: Business intelligence, Predictive modeling and advanced analytics, WibiData

13 Comments

November 5, 2012

Real-time confusion

I recently proposed a 2×2 matrix of BI use cases:

Is there an operational business process involved?
Is there a focus on root cause analysis?

Let me now introduce another 2×2 matrix of analytic scenarios:

Is there a compelling need for super-fresh data?
Who’s consuming the results — humans or machines?

My point is that there are at least three different cool things people might think about when they want their analytics to be very fast:

Fast investigative analytics — e.g., business intelligence with great query response.
Computations on very fresh data, presented to humans — e.g. “heartbeat” graphics monitoring a network.
Computations on very fresh data, presented back to a machine — e.g., a recommendation engine that includes makes good use of data about a user’s last few seconds of actions.

There’s also one slightly boring one that however drives a lot of important applications: Read more

Categories: Business intelligence, Games and virtual worlds, Log analysis, Predictive modeling and advanced analytics, Splunk, Streaming and complex event processing (CEP), WibiData

5 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Predictive modeling and advanced analytics

How is the surveillance data used?

WibiData and its Kiji technology

Some stuff I’m working on

Analytic application themes

Should you offer “complete” analytic applications?

It’s hard to make data easy to analyze

Amazon Redshift and its implications

ParAccel update

Analytic application subsystems

Real-time confusion

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin