August 26, 2008

Why MapReduce matters to SQL data warehousing

Greenplum and Aster Data have both just announced the integration of MapReduce into their SQL MPP data warehouse products. So why do I think this could be a big deal? The short answer is “Because MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up.” The long answer goes something like this.

The core ideas of MapReduce are:

For large problems, parallel computing is much more cost effective and/or feasible than the alternatives.
If you shoehorn programs into a certain very simple framework – namely that you’re limited to only having map and reduce steps — then building a general execution engine that gives parallelism “for free” is straightforward.
A lot more problems can be solved within that framework than one might at first expect.

In essence, you can do almost anything to a single record* — that’s a map step. But you are sharply limited in how you combine information about multiple (often intermediate) records – that’s a reduce step. Still, reduce steps let you do counts, sums, or other aggregations. That, plus the general power of map steps, makes MapReduce useful for at least three major classes of applications:

Text tokenization, indexing, and search
Creation of other kinds of data structures (e.g., graphs)
Data mining and machine learning

Except for the building of entire search engines, these are all application areas that data warehouse users should and do care about. And they all still could benefit from large performance increases, as is evidenced by the routine compromises analysts make in areas such as data reduction, sampling, over-simplified models and the like.

*Technically, MapReduce doesn’t allow for records. Instead, you process key-value pairs and lists of same. But so far as I can tell, that’s a distinction without a difference. LISP long ago proved that lists are a very general construct indeed.

MapReduce can be superior to pure SQL for these application areas, because they involve creation of data structures that are awkward to fit into a SQL rows-and-tables paradigm. Inverted-list text indexes just aren’t tables. Formally, graphs can always be fit into tables; but even so, if you want to follow a graph for numerous hops, relational structures can be problematic. Data mining can involve very high-dimensional problems with super-sparse tables. And while exhaustive text extraction into flat tables works OK, getting from there to common-sense semantic hierarchies can be a bit of a kludge.

Some of our recent links about MapReduce

Categories: Analytic technologies, Data warehousing, MapReduce, Parallelization

Subscribe to our complete feed!

Comments

24 Responses to “Why MapReduce matters to SQL data warehousing”

Steve Wooledge on August 26th, 2008 2:55 am

Curt,

We’ve seen the power of MapReduce is of immense use in Transformations (during the T step of an ELT processing) and in Data Preparation (before Export of data) as well.

—
Steve Wooledge
Aster Data Systems
http://www.asterdata.com
Three approaches to parallelizing data transformation | DBMS2 -- DataBase Management System Services on August 26th, 2008 4:03 pm

[…] third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one […]
Jim Peters on August 26th, 2008 5:31 pm

Does anyone know whether Aster and/or GreenPlum signed any kind of license with Google in order to get access to MapReduce technology, or to get permission to use the term?
Curt Monash on August 26th, 2008 5:59 pm

I’ve never asked. I can’t think of any reason why they would have had to.

CAM
Matt Weight on August 27th, 2008 1:54 am

You don’t need permission to implement Map/Reduce. Google published several papers on the subject including a famous white paper several years ago which was the launching point for the Map/Reduce implementation which my company developed in house.
MapReduce links | DBMS2 -- DataBase Management System Services on August 27th, 2008 5:15 am

[…] The integration of MapReduce with SQL data warehousing […]
Known applications of MapReduce | DBMS2 -- DataBase Management System Services on August 27th, 2008 5:22 am

[…] The integration of MapReduce with SQL data warehousing […]
MapReduce sound bites | DBMS2 -- DataBase Management System Services on August 27th, 2008 5:23 am

[…] The integration of MapReduce with SQL data warehousing […]
Databases leverage MapReduce technology to radically juice data scale, performance, analytics | Dana Gardner’s BriefingsDirect | ZDNet.com on August 27th, 2008 12:18 pm

[…] Monash, president of Monash Research, editor of DBMS2, and a leading authority on MapReduce, sees this as a major leap forward. He reports that both companies had completed adding MapReduce to their existing products and had […]
Luke Lonergan on August 28th, 2008 1:09 pm

There is a coding tutorial available at this link in the middle of the page: http://www.greenplum.com/resources/mapreduce/

Key things to note about Greenplum’s MR implementation:
– It’s very similar in form and expression to Google and Hadoop
– Extensions for Joins and Pipelined task execution
– Native parallel file access
– Parallelism is full and transparent to the programmer

In summary: we have implemented MapReduce within which you can write SQL, Perl, Python and many more languages. It is straightforward use MR programs written for Hadoop or Google and port them to Greenplum.
Luke Lonergan on August 29th, 2008 6:09 am

On the topic of licensing:

Licensing is not required for MapReduce as it is a work derived from many sources of publicly shared know-how. It dates back to the original Lisp operators Map and Reduce.

The Wikipedia page is pretty complete here:
http://en.wikipedia.org/wiki/MapReduce

Greenplum’s MapReduce support is designed to provide a superset of the semantic content of open source Hadoop and Google’s implementations, making it straightforward to port from those environments to Greenplum’s data analysis and management engine.
Steve Wooledge on August 29th, 2008 6:25 pm

Just a couple points on Aster’s implementation of MapReduce:
+ Developers can use Java, Python, C, Perl, and more to create SQL/MR functions which are then easily used by BI tools or business analysts as common SQL statements
+ Aster’s In-Database MapReduce framework is a superset of MapReduce
+ Aster has a process management framework to guarantee transparency and availability

More in our whitepaper here:
http://www.asterdata.com/product/whitepaper_mapreduce.html
Anonymous on August 31st, 2008 2:53 am

It seems to me that Hadoop and MapReduce in general needs to avoid being bogged down by dealing with a database. It’s about accessing files in parallel without all the garbage that a database puts people through.

I don’t want to have to write an application with a SQL driver and write SQL to use MapReduce. I think that’s kind of the whole point.

I haven’t looked at the other DB vendors of MapReduce, but when I look at the Asterdata examples it looks like a database trying to do MapReduce using UDFs, which kind of misses the whole point for me.
Curt Monash on August 31st, 2008 6:40 am

Greenplum and Aster have somewhat different approaches to SQL/MapReduce integration. I want to look into them both further before trying to write about the respective syntaxes.

That said, a little SQL wrapper never hurt anybody.

CAM
Yes, but what are the Very Biggest benefits of MapReduce? | DBMS2 -- DataBase Management System Services on September 1st, 2008 5:08 am

[…] benefits, features, etc. to various constituencies (business users, programmers, DBAs, etc.) of the Greenplum and Aster Data MapReduce announcements. Questions like that are hard to answer simply. Here’s […]
Mike Stonebraker’s counterarguments to MapReduce’s popularity | DBMS2 -- DataBase Management System Services on September 4th, 2008 7:39 pm

[…] line: Mike Stonebraker more than disagrees with the claim that MapReduce is a valuable addition to SQL data warehousing, on somewhat different grounds than he emphasized in the Great MapReduce Debate last January. […]
Blog user interfaces | Text Technologies on September 11th, 2008 5:30 am

[…] of the sidebars. And I link to other of my posts whenever it seems to make sense, as in my posts on MapReduce and database […]
Infology.Ru » Blog Archive » Три подхода к распараллеливанию процесса преобразования данных on September 29th, 2008 5:23 pm

[…] подход является «Темой Недели»: MapReduce. Когда я опубликовал список канонических приложений […]
Infology.Ru » Blog Archive » Почему MapReduce так важен для хранилищ данных? on October 5th, 2008 2:58 am

[…] Автор: Curt Monash Дата публикации оригинала: 2008-08-26 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]
CouchDB/0 on November 6th, 2008 5:33 pm

[…] Jak to widać na uroczym obrazku na stronie CouchDB, bebechy można podzielić na silnik “widoków”, storage i replikacje. Pisząc o widokach należy wspomnieć, że nasza znajomość SQL-92 i pochodnych nie zda się na nic. Nie ma klasycznych zapytań, zamiast tego stosowana jest metoda MapReduce. Funkcje map i reduce piszemy w JavaScripcie, ale tak naprawdę można dodać obsługę funkcji w dowolnym języku (o tym za chwilę). Na temat samego MapReduce można wiele napisać, dlatego nie będę w tej chwili rozwijał tego tematu. Zainteresowanych odsyłam do Wikipedii i, na przykład, tego wpisu. […]
Aster Data in the cloud | DBMS2 -- DataBase Management System Services on February 10th, 2009 4:49 pm

[…] and wish Aster wouldn’t tie its marketing identity so closely to the admittedly cool supports-MapReduce feature. That said, I do think Aster’s nPath story is pretty interesting, and I plan to blog about […]
MapReduce: An exclusive Software Framework for Distributed Systems by Google « Hashim’s Blog on May 7th, 2009 5:54 pm

[…] http://en.wikipedia.org/wiki/Map_Reduce • DMBS2, Why MapReduce matters to SQL data warehousing http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/ • The Database Column, MapReduce: A major step backwards, By David DeWitt on January 17, 2008 […]
NoSQL? | DBMS2 -- DataBase Management System Services on July 1st, 2009 3:33 am

[…] MapReduce is an exception, in that it’s designed for analytics. MapReduce may be useful for enterprises. But where it is, it probably should be integrated into an analytic DBMS. […]
MapReduce replacing complex SQL queries - Enterprise IT Consultant Views on Technologies and Trends on October 11th, 2010 3:01 am

[…] Considering that MapReduce excels in aggregation and computation, data warehousing and business intelligence are the first to adopt MapReduce. A very interesting article on how MapReduce is relevant to Data Warehousing products is available at http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Why MapReduce matters to SQL data warehousing

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin