Comments on: Three big myths about MapReduce

By: Clearing up MapReduce confusion, yet again | DBMS2 -- DataBase Management System Services

Wed, 30 Dec 2009 10:51:31 +0000

[…] frustrated by a constant need — or at least urge — to correct myths and errors about MapReduce. Let’s try one more […]

By: Cubegeek

Cubegeek — Mon, 30 Nov 2009 14:12:23 +0000

This conversation makes me wonder if anyone has plans to extend MDX to include MR functions or context. After all, this was the language designed to handle multidimensional data as a standard.

By: Analytics Team » Blog Archive » Myths about MapReduce

Analytics Team » Blog Archive » Myths about MapReduce — Sat, 07 Nov 2009 22:04:29 +0000

[…] DBMS2 takes a look at these three myths about mapreduce… * MapReduce is something very new * MapReduce involves strict adherence to the Map-Reduce programming paradigm * MapReduce is a single technology […]

By: …und das Leben nach SQL geht weiter…jetzt wird reduziert! | PHP hates me - Der PHP Blog

Thu, 22 Oct 2009 06:15:21 +0000

[…] Kritische Stimmen zu MapReduce: Three big myths about MapReduce, DBMS2, October 18, 2009 […]

By: Amrith Kumar

Amrith Kumar — Tue, 20 Oct 2009 22:17:37 +0000

Steve Wooledge,

I am flattered that you confused me for David DeWitt and Stonebraker 🙂

They are the ones who are quoted as saying MapReduce wasn’t something new. MapReduce is a creation of Ghemawat and Dean.

All I’m saying is that recent claim by many that they’ve been “doing MapReduce all along” are not entirely true (and not entirely false either).

I’m not equating MR with the MPP redistribution framework, hence my comment that reads “… the simple answer is this: they have been doing something VERY MUCH LIKE MapReduce all along”.

Thanks,

-amrith

By: Steve Wooledge

Steve Wooledge — Tue, 20 Oct 2009 18:34:10 +0000

@Jerome: Our customers write SQL-MR functions to do computations on data that would have been extremely complicated, error-prone or slow-performing if done using only SQL. Therein lies a key motivator – customers consciously choose SQL-MR for convenience as opposed to being transparently locked-in.

Our SQL-MR syntax goes a long way in ensuring that the relational model is preserved. For example, MR functions consume and produce relations; MR invocations are modeled as stored-procedure invocations. This means that a customer can migrate to non-Aster nCluster installations with an effort similar to migrating their user-defined functions from one platform to another.

The best part of our SQL-MR framework is that the implementation of the MR functions are in open languages chosen by the customer (e.g., Java, Perl, Python, C++, C#, Ruby, etc.). This means that the actual code is not proprietary to Aster nCluster. The code snippets/sub-functions can be re-used in other platforms as well. In addition, the Map Reduce programming model has widespread popularity allowing for portability since the structure of one’s code is first to design the Map-Reduce design, and secondarily express it in SQL-MR to the extent that one uses features unique to our platform.

The libraries of Aster SQL-MR functions that we provide are, of course, proprietary. They have innovations in data structure and processing that ensure high performance of the compute function. We’ve published the source code of some of these functions; for others, we’ve published the algorithms but not the source code; for the rest, we may not publish either the code/algorithm. In fact, we encourage our partners who write SQL-MR functions complete discretion on publishing their functions or providing only binaries to protect their IP.

The important point to note here is that we are committed to providing an open platform in which one function is not forced upon the end-user.

===

@Amrith: Whenever an innovative system becomes mainstream, there are always claims that the innovation is nothing new! We went through this in the 1990s when Java appeared on the scene as well.

We cannot equate Map-Reduce programming framework to the internal re-distribution mechanism of tuples in MPP databases. The Map-Reduce programming framework is innovative because it provides a way of attaining parallelism for arbitrary computations. The internal MPP DB tuple re-distribution mechanisms operated on one-tuple at a time with a static hash function that had the number of partitions statically pre-defined. The mechanism could not be re-used by users or database applications – in fact, it could not be re-used even by stored procedures that were part of the MPP DB framework.

If you are interested, please look at the Related Work section of our VLDB 2009 conference paper. http://www.asterdata.com/resources/downloads/whitepapers/sqlmr.pdf

By: John Mount

John Mount — Mon, 19 Oct 2009 17:13:59 +0000

A good point, but while Map Reduce is not new I feel it emphasized clarity and simplicity (at least for the problem of sorting), so that is probably why it markets easier than MPI or a database. I wrote a bit on this point some time ago: http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/

By: uberVU - social comments

uberVU - social comments — Mon, 19 Oct 2009 11:06:35 +0000

Social comments and analytics for this post…

This post was mentioned on Twitter by jameskobielus: Read @CurtMonash on MapReduce (http://bit.ly/2pdJ1W). None of this brand new. Nor is it true standard. Vendor implementations vary widely….

By: Amrith Kumar

Amrith Kumar — Mon, 19 Oct 2009 00:38:08 +0000

Jerome’s point is dead on; SQL/MR is analogous to vendors custom SQL Extension. That is one of my concerns about these MR extensions; they introduce vendor lock-in.

And as for the recent claim by many that they’ve been “doing MapReduce all along”, the simple answer is this: they have been doing something VERY MUCH LIKE MapReduce all along.

MPP databases horizontally partition the data and process partitions on distinct nodes. MapReduce does not perform the partitioning apriori, it does it at runtime. MPP implementations that I am familiar with always perform the partitioning of persistent data (tables) apriori with provisions to redistribute the data as part of the query processing mechanism.

Dean & Ghemawat write, “The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed
in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specifed by the user.”

Each MPP implementation has a different name for the mechanism to perform this splitting. In effect therefore MapReduce is another mechanism for MPP’izing the solution to a problem and there is some merit to the claim being made by MPP database vendors that they’ve been doing MapReduce all along.

By: Curt Monash

Curt Monash — Sun, 18 Oct 2009 17:01:22 +0000

Jerome,

The Aster syntax is Aster-specific, just as if you used any other vendor’s proprietary SQL extensions.

CAM