I recently proposed a 2×2 matrix of BI use cases:
- Is there an operational business process involved?
- Is there a focus on root cause analysis?
Let me now introduce another 2×2 matrix of analytic scenarios:
- Is there a compelling need for super-fresh data?
- Who’s consuming the results — humans or machines?
My point is that there are at least three different cool things people might think about when they want their analytics to be very fast:
- Fast investigative analytics — e.g., business intelligence with great query response.
- Computations on very fresh data, presented to humans — e.g. “heartbeat” graphics monitoring a network.
- Computations on very fresh data, presented back to a machine — e.g., a recommendation engine that includes makes good use of data about a user’s last few seconds of actions.
There’s also one slightly boring one that however drives a lot of important applications:
- Analytics fed to machines on not-so-fresh data — e.g., call center software that doesn’t check whether the caller used the company’s website within the last minute.
Every so often, sometimes gets the bright idea to take whatever they offer in one of these areas and call it “real-time”. Confusion invariably ensues, for reasons including:
- If there’s a human being in the loop, then at best that’s human real-time.
- If super-fresh data is a nice-to-have rather than being essential to the use case, then it’s not a true real-time story at all.
Thus, I think the industry would be better off if the phrase “real-time” were never used again. Monash’s First Law of Commercial Semantics teaches why this isn’t likely; but a guy can dream, can’t he?
I write about fast analytic computation often; my recent posts on Platfora, Impala, and Teradata and Hadapt are just three of literally hundreds. So let’s focus instead on the other two cool areas. For starters:
- Any one act of receiving or transmitting (a small quantity of) data can, in principle, be done in milliseconds.
- The most basic point of “streaming” is to ensure that small reads or (more likely to be the problem) writes don’t bottleneck each other.
- Computations on fresh data are usually simple. If you can’t do a computation quickly at all, then in particular you can’t do it quickly on fresh data. But examples of simple computations include a lot of what goes on in data manipulation, notably:
- Filters, pattern matching, and rule-checking.
- Arithmetic with small numbers of inputs.
- Running-total aggregates — i.e. “counters”.
Based on all that, it would seem that:
- Computing on super-fresh data for human or machine consumption should be very similar problems …
- … unless the machines demand much lower latency than humans would ever need. (Paradigmatic example: Algorithmic trading.)
But while that’s probably true for enabling technology, application demand (and supply) tend to be more focused. In particular:
1. The CEP/stream processing industry of course lives off of algorithmic trading — machines informing machines. They also make various efforts at super-fresh BI, but don’t necessarily get much traction outside of the investment vertical.
2. Most NoSQL and NewSQL vendors that I know — to the extent they have customers at all — have users in gaming, some sort of ad serving, and/or some sort of personalization. Usually, there are counters and/or model scoring somewhere in the story — i.e., machines informing machines. So one way or another, they’re all active in my category of “computations on very fresh data, presented back to a machine.” But I hear fewer stories from them in the area of super-fresh BI.
3. Splunk makes its living from super-fresh BI. But when machines help humans to monitor other machines, at some point the distinction between wet-brain and cybernetic users gets blurry.
And I’ll pause there, continuing the discussion in a general post about the role and future of analytic RDBMS.
- Integrating short-request and analytic processing (March, 2011)