Analytic DBMS and other analytic platform technologies are much faster than they used to be, both in absolute and price/performance terms. So the question naturally arises, “When is the performance enough?” My answer, to a first approximation, is “Never.” Obviously, your budget limits what you can spend on analytics, and anyhow the benefit of incremental expenditure at some point can grow quite small. But if analytic processing capabilities were infinite and free, we’d do a lot more with analytics than anybody would consider today.
I have two lines of argument supporting this view. One is application-oriented. Machine-generated data will keep growing rapidly. So using that data requires ever more processing resources as well. Analytic growth, rah-rah-rah; company valuation, sis-boom-bah. Application areas include but are not at all limited to marketing, law enforcement, investing, logistics, resource extraction, health care, and science.
The other approach is to point out some computational areas where vastly more analytic processing resources could be used than are available today. Consider, if you will, statistical modeling, graph analytics, optimization, and stochastic planning.
Statistical practice, in many cases, still goes something like this:
- A data set has, for example, a thousand columns.
- Statisticians carefully choose a few dozen columns to model on.
- They then also decide how to modify data in some of the columns for better modeling (binning, filling in nulls, whatever).
- A linear regression ensues.
That all makes sense. Sometimes using fewer variables gives better results than using more of them (because of over-fitting), and you have to pick: You can’t realistically try all 2^1000-1 variable combinations; if you allowed quadratic terms too, you’d approach 2^500,000 combinations; and the possibilities expand from there. But suppose you actually did have unlimited computational resources? Then, if nothing else, you could do a whole lot of regressions, followed by some kind of meta-analysis on the results. That’s so beyond the realm of computational reality I doubt the mathematics of same has even been carefully worked out — which is exactly my point.
Graph analytics, to a first approximation, takes order(N*(E^H)) steps, where N is the number of nodes, H is the number of hops you want to go out, and E is the average number of edges per node. And that’s only for the simple stuff, which might produce inputs into further analytic steps. Such numbers get forbiddingly big, really fast.
Last year I talked with an LTL (Less than TruckLoad) shipping company. They had to decide which freight to put into which trucks, and then where to send the trucks. I thought for a moment, and said “In other words, the traveling salesman has to decide how to pack his knapsack?”* Optimization is computationally hard.
*Wikipedia has a wonderful list of NP-complete problems.
I was a stock analyst at the dawn of electronic spreadsheets. I actually did my first training spreadsheet using a calculator and green paper, and had one older colleague who still used a slide rule. The wonderful thing about making projections via electronic spreadsheets was that you could vary assumptions, then have the conclusions automatically recalculated. Bliss! (At least when compared to the alternatives.)
After a couple of months on the job, I circulated a memo about how one SHOULD do projections. I was told they almost fired me. And they had a point, because the computational power needed was ridiculous. The first part of the idea was to pull in every variable that seemed to make sense, and postulate (or test if possible) relationships among them. The second was to look at the outcomes under many different values of the independent variables.
In other words, I was advocating Monte Carlo stochastic planning. Well, guess what! Monte Carlo analysis is getting more widely productized, due to its usefulness in Basel 3 risk analysis. Traditional business planning still stinks. The time either has come or else is coming soon when traditional business planning should invoke Monte Carlo methods.
Bottom line: The analytic need for speed will remain with us through the foreseeable future — and I didn’t even need to do a probabilistic analysis to figure that out.