MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions
Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn’t prove stable, here also is a registration-required link from IBM’s Conor O’Mahony.) My comments include:
- The Forrester Wave’s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.
- The Forrester Wave for “enterprise Hadoop” contradicts itself on the subject of Hortonworks.
- The Forrester Wave for “enterprise Hadoop” is correct when it says “Hortonworks … has Hadoop training and professional services offerings that are still embryonic.”
- Peculiarly, the Forrester Wave for “enterprise Hadoop” also says “Hortonworks offers an impressive Hadoop professional services portfolio”. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are … well, a good word might be “embryonic”.
- Forrester Waves always seem to have weird implicit definitions of “data warehousing”. This one is no exception.
- Forrester gave top marks in “Functionality” to 11 of 13 “enterprise Hadoop” vendors. This seems odd.
- I don’t know why MapR, which doesn’t like HDFS (Hadoop Distributed File System), got top marks in “Subproject integration”.
- Forrester gave top marks in “Storage” to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum’s technology is a superset of MapR’s. Very strange. (Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)
- Forrester gave higher marks in “Acceleration and optimization” to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.
- I’m not sure what Forrester is calling a “Distributed EDW file store connector”, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.
- Forrester’s “Strategy” rankings seem to correlate to a metric of “We’re a large enough vendor to go in N directions at once”, for various values of N.
- Forrester is correct to rank Cloudera’s “Adoption” as being stronger than EMC/Greenplum’s or MapR’s. But Hortonworks’ strong mark for “Adoption” baffles me.
| Categories: Cloudera, Data warehousing, EMC, Greenplum, Hadoop, Hortonworks, MapR, MapReduce, Pentaho | 6 Comments |
Notes on the Oracle Big Data Appliance
Oracle announced its Big Data Appliance. Specs may be found in the Oracle Big Data Appliance press release. Beyond that:
- The most important software on the Oracle Big Data Appliance is a full set of Cloudera Enterprise code. Oracle will do Tier 1 Cloudera/Hadoop support, while Cloudera handles Tiers 2 and 3.
- The key spec ratios are 1 core/4 GB RAM/3 TB raw disk. That’s reasonably in line with Cloudera figures I published in June, 2010.
- This is really Oracle’s multi-structured big data appliance. Oracle’s relational big data appliance is Exadata, which has been out for years and has comparable capacity to Oracle’s new “Big Data Appliance.” (Chris Preimesberger made a similar point.)
- The Oracle Big Data Appliance list price is $450,000 for 18 12-core servers, plus $54,000/year maintenance.
- That’s around $25,000 per server (and associated storage).
- That’s also around $2,000/core.
- That’s also around $500/TB of spinning disk, before compression.
- None of those per-unit figures sounds ridiculous …
- … but because of Oracle’s appliance configuration there’s indeed a hefty minimum initial purchase.
A couple of links explaining Cloudera Manager
Predictably, I wasn’t pre-briefed on the details of Oracle’s Big Data Appliance announcement today, and an inquiry to partner Cloudera doesn’t happen to have been immediately answered.* But anyhow, it’s clear from coverage by Larry Dignan and Derrick Harris that Oracle’s Big Data Appliance includes:
- Some version of Cloudera Manager (I’m guessing more or less the best one).*
- Some version of Apache Hadoop (I’m guessing the same distribution that Cloudera prefers to use).*
- Some kind of support.
In other words, it’s a lot like getting Cloudera Enterprise,* plus some hardware, plus some other stuff.
*Edit: About 2 minutes after I posted this, I got email from Cloudera CEO Mike Olson. Yes, the Oracle Big Data Appliance bundles Cloudera Enterprise.
That raises an anyway recurring question: What exactly is Cloudera Manager? Read more
| Categories: Cloudera, Data warehouse appliances, Hadoop, MapReduce, Oracle | Leave a Comment |
Hadapt is moving forward
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include: Read more
| Categories: Hadapt, Hadoop, MapReduce, PostgreSQL, Theory and architecture, Workload management | 4 Comments |
MarkLogic’s Hadoop connector
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
| Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management | 2 Comments |
IBM is buying parallelization expert Platform Computing
IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more
| Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research | 5 Comments |
Cloudera versus Hortonworks
A few weeks ago I wrote:
The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.
and
… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.
Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:
- It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
- Such diversity is a Very Good Thing.
- Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
- If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.
| Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source | 6 Comments |
Some notes on Hadoop (mainly) and appliances
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was: Read more
| Categories: Data warehouse appliances, EMC, Greenplum, Hadoop, MapR, MapReduce, Open source, Oracle, eBay | 2 Comments |
Hadoop notes
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
| Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics | 5 Comments |
Hadoop evolution
I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:
- NameNode evolution, and the related issue of file-count limitations.
- JobTracker evolution.
Arun previously addressed these issues and more in a June slide deck.
Read more
| Categories: Hadoop, MapReduce, Parallelization, Workload management, Yahoo | 6 Comments |
