MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
IBM is buying parallelization expert Platform Computing
IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes: Read more
| Categories: Hadoop, IBM and DB2, Investment research and trading, MapReduce, Parallelization, Scientific research | 5 Comments |
Cloudera versus Hortonworks
A few weeks ago I wrote:
The other big part of Hortonworks’ story is the claim that it holds the axe in Apache Hadoop development.
and
… just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I’m not persuaded that the “we know this stuff better” part of the Hortonworks partnering story really holds up.
Now Mike Olson — CEO of my client Cloudera — has posted his analysis of the matter, in response to an earlier Hortonworks post asserting its claims. In essence, Mike argues:
- It’s ridiculous to say any one company, e.g. Hortonworks, has a controlling position in Hadoop development.
- Such diversity is a Very Good Thing.
- Cloudera folks now contribute and always have contributed to Hadoop at a higher rate than Hortonworks folks.
- If you consider just core Hadoop projects — the most favorable way of counting from a Hadoop standpoint — Hortonworks has a lead, but not all that big of one.
| Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Open source | 6 Comments |
Some notes on Hadoop (mainly) and appliances
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
3. On its latest earnings call, Oracle clearly said it would introduce a Hadoop appliance, versus just hinting at a Hadoop appliance the prior quarter. The money quote was: Read more
| Categories: Data warehouse appliances, eBay, EMC, Greenplum, Hadoop, MapR, MapReduce, Open source, Oracle | 2 Comments |
Hadoop notes
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
| Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics | 5 Comments |
Hadoop evolution
I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:
- NameNode evolution, and the related issue of file-count limitations.
- JobTracker evolution.
Arun previously addressed these issues and more in a June slide deck.
Read more
| Categories: Hadoop, MapReduce, Parallelization, Workload management, Yahoo | 6 Comments |
Introduction to Zettaset
Zettaset is confusing, but as best I understand:
- Zettaset sells Hadoop add-on/enhancement software, with what might be called an enterprise-friendly Hadoop management focus.
- “Business intelligence” gets mentioned prominently in Zettaset’s marketing, but not in what executives now say. Apparently the BI focus is old news, predating a hard pivot.
- Zettaset’s marketing also mentions NoSQL, for little reason that I can discern, except insofar as Zettaset relies on HBase.
- CEO Brian Christian told me that Zettaset has been around since December, 2007; on the other hand, Zettaset press release boilerplate says Zettaset was founded in 2009. Apparently, the distinction is that Zettaset was founded in 2007 as a consultancy, but turned its efforts to software development in 2009.
- Zettaset has fewer than 20 people but is “hiring like mad.”
- Zettaset just did a $3 million Series A round — or maybe is just announcing it now; the latter interpretation might explain how those 20ish people are getting paid.
- Zettaset’s product was just launched and made generally available, notwithstanding that Version 2 of Zettaset’s product was shipped last year to fanfare on Zettaset’s blog.
- Zettaset’s pricing is based on how many terabytes of compressed data it is being used to manage.
- Until very recently, Zettaset was called GOTO Metrics; I imagine the name change is connected to the strategy pivot.
- Zettaset told me of one big customer — with an almost-petabyte Hadoop cluster before compression — namely Zions Bancorporation.
- Zettaset has “a number of paying customers” overall.
| Categories: Hadoop, HBase, MapReduce, Market share and customer counts, Zettaset | 4 Comments |
Hadoop futures and enhancements
Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:
- Cloudera’s proprietary code is focused on management, set-up, etc.
- The “Phase 1″ plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2″.*
- MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
- So does Hadapt, but mainly vs. Hive.
- Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.
(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)
| Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset | 20 Comments |
Cloudera and Hortonworks
My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:
- Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn’t have proprietary code; it’s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).
- Paid subscription support for Apache Hadoop and, in connection with that …
- … proprietary software that all support customers automatically get. There are two points to this proprietary software:
- It adds value for the customer.
- It makes Cloudera’s support job easier.
- Professional services around Hadoop.
- Training and conferences around Hadoop, which probably don’t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.
Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.
| Categories: Cloudera, Hadoop, Hortonworks, IBM and DB2, MapReduce, Open source, Yahoo | 9 Comments |
Hadapt update
I met with the Hadapt guys today. I think I can be a bit crisper than before in positioning Hadapt and its use cases, namely:
- Hadapt is additional software on a cluster that also runs fully functional Hadoop/HDFS. (Cloudera Hadoop more than straight-from-Apache Hadoop to date, but that’s not a requirement.)
- The cluster also runs a DBMS on every node, such as PostgreSQL or one of Infobright/Vectorwise.
- Hadapt’s software manages parallel SQL queries by distributing them to the DBMS living on each node. Hadapt says that the resulting query performance far outshines Hive’s.
- Hadapt further says that, by exploiting the partner DBMS, its SQL functionality outpaces Hive’s as well.
- Target Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the relational store.
- In particular, Hadapt seems like an interesting choice when you want to use that relational data as you work on other data that’s still in HDFS, or if you want to keep using the relational data in other kinds of MapReduce jobs.
- That all fits well with my thoughts about the importance of derived data.
Other evolution from what I wrote about Hadapt a few months ago includes:
- Hadapt is in beta now.
- Hadapt has added adult supervision in the form of Philip Wickline, late of Endeca.
In other news, Hadapt is our newest client.
| Categories: Analytic technologies, Cloudera, Data models and architecture, Data warehousing, Hadapt, Hadoop, Infobright, MapReduce, Open source, PostgreSQL, VectorWise | Leave a Comment |
Petabyte-scale Hadoop clusters (dozens of them)
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
