Rich Skrenta is quite a successful entrepreneur, so it’s likely that he doesn’t really mean the more ridiculous parts of this rant on the MapReduce debate. E.g., he cheerfully disregards the fact that the data warehouse appliance vendors have ALREADY disrupted the market he’s focusing on. Index-light row-based and columnar systems are both super fast at data mining extracts.
But let’s go straight to the one interesting thing he said, namely:
… wouldn’t be surprised if the adoption curve, even for conservative Fortune-500 companies, was quicker than we’ve seen in the past. Bolt a map/reduce cluster onto the side of your data warehouse and mine those CRM records for business insights. Sounds like a startup idea we’ll be seeing soon enough.
For that to make sense, MapReduce would have to be used for more things than SAS can efficiently deliver via its own MPP/grid strategy. That’s not going to happen if you just take numeric data out of a neat tabular structure, do a couple of processing passes, and reach conclusions.
And by the way, Google Analytics runs on MapReduce, and it’s a reliability disaster, as of July or earlier in July (although that may have been a front-end problem) or most especially May. Similarly, FeedBurner has been quite fried the past few days, showing wild stats swings and currently claiming on some (but not all screens) that my blogs have no views or hits whatsoever.
Rather than disrupting the already disrupted market for conventional data mining underpinnings, something like MapReduce could be the underpinnings for variable-schema analytics. There are a ton of reasons for having very different kinds of information about different subjects of analysis. For example:
- It may be gathered via different marketing campaigns, that gathered different kinds of information.
- It may come from consumer-information databases around the world, that captured different information and have different privacy requirements.
- It may come from location-aware devices that some subjects carry and some don’t.
- It may pertain to products and experiences (e.g., house purchases, driving records) that some subjects have and some don’t.
The main analytic techniques for dealing with such situations today basically boil down to:
- Graceful ways of encoding “Oops, we have no information about that”.
- Making up minimally-misleading fake data.
If and when new ones are developed, there’s no reason to assume they’ll run optimally over relational data warehouse DBMS.