January 19, 2008

MapReduce for data mining? Maybe for variable-schema analytics.

Rich Skrenta is quite a successful entrepreneur, so it’s likely that he doesn’t really mean the more ridiculous parts of this rant on the MapReduce debate. E.g., he cheerfully disregards the fact that the data warehouse appliance vendors have ALREADY disrupted the market he’s focusing on. Index-light row-based and columnar systems are both super fast at data mining extracts.

But let’s go straight to the one interesting thing he said, namely:

… wouldn’t be surprised if the adoption curve, even for conservative Fortune-500 companies, was quicker than we’ve seen in the past. Bolt a map/reduce cluster onto the side of your data warehouse and mine those CRM records for business insights. Sounds like a startup idea we’ll be seeing soon enough. ;-)

For that to make sense, MapReduce would have to be used for more things than SAS can efficiently deliver via its own MPP/grid strategy. That’s not going to happen if you just take numeric data out of a neat tabular structure, do a couple of processing passes, and reach conclusions.

And by the way, Google Analytics runs on MapReduce, and it’s a reliability disaster, as of July or earlier in July (although that may have been a front-end problem) or most especially May. Similarly, FeedBurner has been quite fried the past few days, showing wild stats swings and currently claiming on some (but not all screens) that my blogs have no views or hits whatsoever.

Rather than disrupting the already disrupted market for conventional data mining underpinnings, something like MapReduce could be the underpinnings for variable-schema analytics. There are a ton of reasons for having very different kinds of information about different subjects of analysis. For example:

The main analytic techniques for dealing with such situations today basically boil down to:

  1. Graceful ways of encoding “Oops, we have no information about that”.
  2. Making up minimally-misleading fake data.

If and when new ones are developed, there’s no reason to assume they’ll run optimally over relational data warehouse DBMS.

Comments

2 Responses to “MapReduce for data mining? Maybe for variable-schema analytics.”

  1. Stuart Frost on January 21st, 2008 8:29 pm

    This is all interesting on a number of fronts.

    First of all, the critique of MapReduce by DeWitt and Stonebraker is breathtakingly arrogant. MapReduce was clearly not designed to solve the same problems as an RDBMS, so it’s strange to criticize it for not having the same functionality. As for the comment that MapReduce will be difficult to scale – well, it’s hard to argue with 20PB per day!

    Google’s benchmarks are also pretty revealing. Using 1,800 servers to grep through 1TB of data in 2.5 mins is incredibly inefficient. Using user defined functions (UDFs) in one of our appliances, I estimate that we’d get through the same amount of work on less than 16 nodes – maybe as few as eight, given the likelihood of higher than normal compression ratios. Not sure about how fast the sort would run on our appliances, but over 800s to sort 1TB on 1,800 servers also seems very, very slow – as do the I/O rates shown on the charts.

    Seems like they are just throwing an awful lot of hardware at the problem – don’t tell Al Gore!

    Stuart Frost
    CEO, DATAllegro

  2. More Google reliability woes | DBMS2 -- DataBase Management System Services on August 25th, 2008 3:50 am

    […] reliability issues are ever worse. As I previously pointed out, this is evidence against the notion that MapReduce is a replacement for established DBMS. Share: These icons link to […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.