April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

2 1/2 petabytes of data managed via Hadoop
10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
Ad targeting queries run every 15 minutes in Hadoop
Dashboard roll-up queries run every hour in Hadoop
Ad-hoc research/analytic Hadoop queries run whenever
Anti-fraud analysis done in Hadoop
Text mining (e.g., of things written on people’s “walls”) done in Hadoop
100s or 1000s of simultaneous Hadoop queries
JSON-based social network analysis in Hadoop

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

Categories: Analytic technologies, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Petabyte-scale data management, RDF and graphs, Specific users, Web analytics

27 Comments

April 14, 2009

Maybe Amazon should be using a real DBMS after all

Supposedly

Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as “adult,” the source said. (Technically, the flag for adult content was flipped from ‘false’ to ‘true.’)

“It’s no big policy change, just some field that’s been around forever filled out incorrectly,” the source said.

Amazon employees worked on the problem well past midnight, and then handed it over to an international team, he said.

This was the best practice for reversing an error — how? Is SimpleDB somehow implicated? If this story is remotely true, and if there’s a sensible database architecture, I can’t imagine why there wouldn’t be a faster fix.

Categories: Amazon and its cloud

7 Comments

April 14, 2009

There always seems to be a fire drill around MapReduce news

Last August I flew out to see my new clients at Greenplum. They told me they planned to roll out MapReduce in a few weeks, and asked for my help in publicizing it. From their offices I went to dinner with non-clients Aster Data, who told me they’d gotten wind of a Greenplum MapReduce announcement and planned to come out ahead of it. A couple of hours later, Aster signed up as a client. In something of a pickle — but not one of my own making — I knocked heads, and persuaded both vendors to announce MapReduce at the same time, namely the following Monday. Lots of publicity ensued for both vendors, and everybody was reasonably satisfied. Read more

Categories: About this blog, Analytic technologies, Aster Data, Greenplum, MapReduce, Michael Stonebraker, Vertica Systems

1 Comment

April 14, 2009

eBay thinks MPP DBMS clobber MapReduce

I talked with Oliver Ratzesberger and his team at eBay last week, who I already knew to be MapReduce non-fans. This time I added more detail.

Oliver believes that, on the whole, MapReduce is 6-8X slower than native functionality in an MPP DBMS, and hence should only be used sporadically. This view is based on part on simulations eBay ran of the Terasort benchmark. On 72 Teradata nodes or 96 lower-powered nodes running another (currently unnamed, as per yet another of my PR fire drills) MPP DBMS, a simulation of Terasort executed in 78 and 120 secs respectively, which is very comparable to the times Google and Yahoo got on 1000 nodes or more.

And by the way, if you use many fewer nodes, you also consume much less floor space or electric power.

Categories: Analytic technologies, eBay, Hadoop, MapReduce, Parallelization, Teradata

11 Comments

April 14, 2009

Stonebraker, DeWitt, et al. compare MapReduce to DBMS

Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):

A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
A database query on simulated clickstream data
A join on the same clickstream data.
Two aggregations on the clickstream data.

Categories: Analytic technologies, Hadoop, MapReduce, Michael Stonebraker, Parallelization, Vertica Systems

6 Comments

April 3, 2009

Amazon Elastic MapReduce

Amazon is introducing a beta of Amazon Elastic MapReduce. What it boils down to is cheap, on-demand Hadoop.

This seems like a great way to experiment with MapReduce and see if you like it. But for serious use, I don’t know why you wouldn’t prefer MapReduce more closely integrated into a DBMS.

Categories: Amazon and its cloud, Cloud computing, MapReduce

1 Comment

April 3, 2009

CSQL: Yet another in-memory DBMS for caching

A few of you care about obscure in-memory DBMS products. Well, I was just e-mailed about another one, apparently called CSQL or CSQLcache. As of now, CSQL has a SourceForge website, a Wikipedia entry, and a blog.

One interesting thing on that blog is a taxonomy of caches — Level 1 cache, Level 2 cache, RAM, disk, etc., with some approximate figures for lookup times. Edit: However, Kevin Closson emailed me to say it’s way out of date. Stay tuned to his blog for more on the subject.

Categories: Cache, In-memory DBMS, Memory-centric data management

3 Comments

April 2, 2009

Ingres update

I talked with Ingres today. Much of the call was fluff — open-source rah-rah, plus some numbers showing purported success, but so finely parsed as to be pretty meaningless. (To Ingres’ credit, they did offer to let me talk w/ their CFO, even if they offered no promises as to whether he’d offer any more substantive information.) Highlights included: Read more

Categories: Actian and Ingres, Data warehousing, EnterpriseDB and Postgres Plus, MySQL, Open source, Oracle, PostgreSQL, Sybase

6 Comments

April 1, 2009

Donald Farmer knocks the April Fool 8-ball out of the park

Donald Farmer has an excellently-crafted April Fool post about a revolution in business intelligence. Look at the character names, for example.

I wonder whether Donald learned operations research from that textbook where two main decision-making characters were Mark Off and his father Pop, an example company was Edifice Wrecks, and an example CEO was Dawn Shirley Light …

Categories: Analytic technologies, Business intelligence, Humor

1 Comment

April 1, 2009

April Fool’s Day highlights

Amazon says it’s taking “cloud” computing to new heights, as it were.

Derivative funds and large government-subsidized entities will be especially interested in FACE’s transmodal operation. They can allocate a dedicated FACE, load it up with data, and then send it out to sea to perform advanced processing in safety. The government will have absolutely no chance of acting against them, because they will be too busy trying to decide which Federal Air Regulation (FAR) was violated, not to mention scheduling news conferences.

First excellent April Fool’s joke I saw this year was from The Guardian. The best so far is from Expedia. Others are linked in my Twitter feed. And personally, I’m encouraging the concept of April No-Fooling Day.

Categories: Amazon and its cloud, Cloud computing, Humor

1 Comment

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Cloudera presents the MapReduce bull case

Maybe Amazon should be using a real DBMS after all

There always seems to be a fire drill around MapReduce news

eBay thinks MPP DBMS clobber MapReduce

Stonebraker, DeWitt, et al. compare MapReduce to DBMS

Amazon Elastic MapReduce

CSQL: Yet another in-memory DBMS for caching

Ingres update

Donald Farmer knocks the April Fool 8-ball out of the park

April Fool’s Day highlights

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin