Comments on: Cloudera presents the MapReduce bull case

By: tecosystems » When Your Customer is Your Competitor: The Return of Roll Your Own

Tue, 12 Jan 2010 22:08:33 +0000

[…] they needed. From either the technology or cost perspectives. As Cloudera’s Jeff Hammerbacher related to Curt Monash, Hadoop enjoyed advantages over commercial relational alternatives for Facebook, […]

By: Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services

Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services — Mon, 11 May 2009 08:29:10 +0000

[…] Updating the metrics in my Cloudera post, […]

By: thumper

thumper — Fri, 01 May 2009 18:05:29 +0000

I think some folk have gotten the idea …

MR is a platform for large scale, parallelized computation. We’ve always dawn a distinction between ETL tools and DBMSs, and MR/Hadoop is probably better conceived of as a next generation ETL platform. Why? Because it has no query capabilities, no indexing, poor tools support, etc.

Which is a fine thing, to be honest. Programming models for parallel computation have always been tough. Looks like MR has hit a nice balance between power and simplicity.

By: Michael McIntire

Michael McIntire — Thu, 30 Apr 2009 18:40:51 +0000

I think the entire discussion bakes down to some simple points. M/R is simple to install, and fast to implement single functions. It has near zero enterprise functions. MPP DBs are hard to setup, and really start to shine with appropriate re-use of structures and enterprise methodology. Add to MPP DB that the structural foundation provided by declarative SQL enables much more independent third party BI tools (add 25 years of development time too).

Keep in mind that Map/Reduce is really only a marketing moniker now, and that these systems are really collections of parallel tools. The few implementation plans I’ve seen usually include join operators in the future for example. It’s growing and getting better, it is still years away from being as efficient as an MPP DB implementation – at least for my use cases.

It is my assertion that the MPP Database vendors entirely missed the impacts of their licensing schemes on web applications/companies. Think the Google or Facebook startup could have afforded Oracle – it would have cost more than the companies made in revenue. Not all applications fit the model of $$ value per transaction – necessity is what has driven the uptake in M/R.

We need open source MPP data management platforms – I specifically am not using DBMS, because there are large classes of analytics which do not lend themselves to relational technology. The closest thing we have to an efficient MPP programming API is OpenMPI and it’s brethren, which makes an MPP Database look like child’s play, much less something as simple as M/R.

I think these technologies are going to merge, they both bring needed things to the table. Look for practical companies bringing these things together.

By: Confluence: Research and Engineering

Confluence: Research and Engineering — Thu, 30 Apr 2009 17:05:04 +0000

Truviso proof of concept…

Truviso proof of concept A summary Scott sent out a few days ago: Scott Musson to swengineering show details Apr 21 (5 days ago) Reply…

By: eBay’s two enormous data warehouses | DBMS2 -- DataBase Management System Services

Thu, 30 Apr 2009 10:25:59 +0000

[…] Facebook has 2 1/2 terabytes managed by Hadoop — without a DBMS! […]

By: Ben Werther

Ben Werther — Thu, 23 Apr 2009 01:55:20 +0000

Hans,

A key question is whether you need to push the data to the application, or push the application to the data. With huge volumes of data, you obviously want to avoid pushing these around as much as possible.

In the case of Greenplum, users can use SQL, MapReduce or a combination of both, and have this pushed to the data. i.e. In MPP database terms, the map step will run locally on each node (with direct access to the data), and the result will be ‘redistributed’ across the interconnect to the reduce steps. There’s no up-front movement of data, and the network movement that does occur is over the high-speed interconnect (i.e. multiple GigE or 10GigE connects per node).

That’s the simplest case. It gets really intesting when you start chaining Map-Reduce steps, or incorporating SQL. For example, you could do something like:

1. SELECT from a table in the database (or any arbitrary query)
-> Use this as the input to a MAP
-> Reduce the result
-> Use this as the input to another MAP
-> Reduce the result
2. Read from a large set of files across the filesystem
-> Use the as the input to a MAP
-> Reduce the result
-> Join the result against a table in the database (arbitrary query)
THEN 3. Join the results of 2 and 3 together
-> Use this as the input to a MAP
– Reduce the result and output to a table or the filesystem

What’s really cool is that this is all planned and executed as one pipelined parallel operation that makes full use of the parallelism of the system. No unnecessary materializing or moving of data, and you can make full use of the parallel hash join and aggregation mechanisms within our parallel dataflow engine.

The net of this: There are a lot of cases where Hadoop does the trick. However if you want to be able to do highly optimized parallel SQL (with full BI tools support for reporting) and MapReduce (for programmatic analyics) against the same data, you definitely want an engine that can do both. You get the ability to blend SQL and MapReduce. But more importantly you aren’t pulling Terabytes of data from one system to another before processing can even begin.

-Ben

By: Curt Monash

Curt Monash — Wed, 22 Apr 2009 05:19:38 +0000

Hans,

If memory serves, Kognitio , Vertica, and Exasol don’t have master nodes. But most of the rest do.

CAM

By: Hans Gilde

Hans Gilde — Wed, 22 Apr 2009 03:50:26 +0000

Probably depends on the DBMS, but in at least some cases each node has all the querying features of a regular, individual DBMS. That’s what I was thinking of. If you can’t query each node then that’s a different situation.

By: Curt Monash

Curt Monash — Wed, 22 Apr 2009 02:04:10 +0000

Wait a moment! There’s a screwy assumption here (and I’ve been just as guilty of overlooking it as you guys).

There’s no way you can run Hadoop on the same machines as an MPP DBMS and minimize network usage the way you can integrating MR into the DBMS. The DBMS gets its results on various nodes, sends them to a head node, and ships them on to requesting program from there. The MPP DBMS — including one with MR extensions — ships data from peer to peer when it makes sense, in most cases never touching (or overburdening!) the head/master/queen node.