November 19, 2012

Incremental MapReduce

My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:

The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.

These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.

A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.

But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light:

For a sharper focus, let’s look at Reduce. Incremental MapReduce works if and only if Reduce steps can operate successfully on time-sliced data.

So what would be a situation in which Reduce doesn’t work if you split it up via time slices, but does work if you split it up in some other way? Three candidate categories come quickly to mind — graphs, time series, and joins. Let’s consider them in turn.

Bottom line: Incremental MapReduce could serve a variety of purposes — but grafting it onto a DBMS is hardly a full replacement for Hadoop.

Comments

One Response to “Incremental MapReduce”

  1. Alex Popescu on November 19th, 2012 12:30 pm

    Curt,

    As far as I know MongoDB’s MapReduce is not incremental. The thread on SO shows how to define *yourself* an incremental MR. In CouchDB’s case, which is at the origin of both Cloudant and Couchbase, views (the way CouchDB exposes MR) are by themselves incremental in the sense that all results of a MR run at a specific time are persistent and it is only future updates to the docs/additions that are processed when the view is re-requested. Note also that this is still taking place after a request to a view (basically it’s a pull model) and not automatically after an update/add (push model).

    A://

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.