January 19, 2008

MapReduce for data mining? Maybe for variable-schema analytics.

Rich Skrenta is quite a successful entrepreneur, so it’s likely that he doesn’t really mean the more ridiculous parts of this rant on the MapReduce debate. E.g., he cheerfully disregards the fact that the data warehouse appliance vendors have ALREADY disrupted the market he’s focusing on. Index-light row-based and columnar systems are both super fast at data mining extracts.

But let’s go straight to the one interesting thing he said, namely:

… wouldn’t be surprised if the adoption curve, even for conservative Fortune-500 companies, was quicker than we’ve seen in the past. Bolt a map/reduce cluster onto the side of your data warehouse and mine those CRM records for business insights. Sounds like a startup idea we’ll be seeing soon enough. 😉

For that to make sense, MapReduce would have to be used for more things than SAS can efficiently deliver via its own MPP/grid strategy. That’s not going to happen if you just take numeric data out of a neat tabular structure, do a couple of processing passes, and reach conclusions.

And by the way, Google Analytics runs on MapReduce, and it’s a reliability disaster, as of July or earlier in July (although that may have been a front-end problem) or most especially May. Similarly, FeedBurner has been quite fried the past few days, showing wild stats swings and currently claiming on some (but not all screens) that my blogs have no views or hits whatsoever.

Rather than disrupting the already disrupted market for conventional data mining underpinnings, something like MapReduce could be the underpinnings for variable-schema analytics. There are a ton of reasons for having very different kinds of information about different subjects of analysis. For example:

It may be gathered via different marketing campaigns, that gathered different kinds of information.
It may come from consumer-information databases around the world, that captured different information and have different privacy requirements.
It may come from location-aware devices that some subjects carry and some don’t.
It may pertain to products and experiences (e.g., house purchases, driving records) that some subjects have and some don’t.

The main analytic techniques for dealing with such situations today basically boil down to:

Graceful ways of encoding “Oops, we have no information about that”.
Making up minimally-misleading fake data.

If and when new ones are developed, there’s no reason to assume they’ll run optimally over relational data warehouse DBMS.

Categories: Analytic technologies, MapReduce, Parallelization, SAS Institute

Subscribe to our complete feed!

Comments

2 Responses to “MapReduce for data mining? Maybe for variable-schema analytics.”

Stuart Frost on January 21st, 2008 8:29 pm

This is all interesting on a number of fronts.

First of all, the critique of MapReduce by DeWitt and Stonebraker is breathtakingly arrogant. MapReduce was clearly not designed to solve the same problems as an RDBMS, so it’s strange to criticize it for not having the same functionality. As for the comment that MapReduce will be difficult to scale – well, it’s hard to argue with 20PB per day!

Google’s benchmarks are also pretty revealing. Using 1,800 servers to grep through 1TB of data in 2.5 mins is incredibly inefficient. Using user defined functions (UDFs) in one of our appliances, I estimate that we’d get through the same amount of work on less than 16 nodes – maybe as few as eight, given the likelihood of higher than normal compression ratios. Not sure about how fast the sort would run on our appliances, but over 800s to sort 1TB on 1,800 servers also seems very, very slow – as do the I/O rates shown on the charts.

Seems like they are just throwing an awful lot of hardware at the problem – don’t tell Al Gore!

Stuart Frost
CEO, DATAllegro
More Google reliability woes | DBMS2 -- DataBase Management System Services on August 25th, 2008 3:50 am

[…] reliability issues are ever worse. As I previously pointed out, this is evidence against the notion that MapReduce is a replacement for established DBMS. Share: These icons link to […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

MapReduce for data mining? Maybe for variable-schema analytics.

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin