When enterprises buy new brands of analytic DBMS, they almost always run proofs-of-concept (POCs) in the form of private benchmarks. The results are generally confidential, but that doesn’t keep a few stats from occasionally leaking out. As I noted recently, those leaks are problematic on multiple levels. For one thing, even if the results are to be taken as accurate and basically not-misleading, the way vendors describe them leaves a lot to be desired.
Here’s a concrete example to illustrate the point. One of my vendor clients sent over the stats from a recent POC, in which its data warehousing product was compared against a name-brand incumbent. 16 reports were run. The new product beat the old 16 out of 16 times. The lowest margin was a 1.8X speed-up, while the best was a whopping 335.5X.
My client helpfully took the “simple average” — i.e. the mean – of the 16 factors, and described this as an average 62X drubbing. But is that really fair? The median speed-up was only 17X. And in a figure I find more meaningful than either of those, the total reduction in execution time – assuming each of the reports was run the same number of times – was “just” 12 times. Edit: As per the comments below, another option is geometric mean speed-up. That turns out to be 19X in this case.
Now, 12X is a whopping speed-up, and this was a very successful POC for the challenger. But calling it 62X is just silly, and that was the point of my earlier post.
So how should POC numbers be weighted? Ideally, one could calculate a big weighted sum: “Our daily workload will be a lot like 2,000 copies each of Queries 1, 2, 3, 4, and 5; 300 copies each of BigQueries 6 and 7; 25 copies of MegaQuery 7; and a copy of DestructoQuery 8; all multiplied by a factor of 17.”
But to come up with reasonable projections, it is not enough to look at past usage. After all, if the price of Query 3 goes down by 5X, while the cost of Query 8 goes down by a factor of 50, the relative consumption of Queries 3 and 8 is apt to change significantly. That’s just the economics of supply and demand.
Bottom line: The more accurately you can predict future data warehouse use, the more confidently you can choose the analytic database technology that’s best for you.