I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.
Cloudera thinks the picture now is:
- 2-socket servers, with 4- or 6-core chips.
- Increasing number of spindles, with 12 2-TB spindles being common.
- 48 gigs of RAM is most common, with 64-96 fairly frequent.
- A couple of 1GigE networking ports.
Discussion around that included:
- Enterprises had been running out of storage space; hence the increased amount of storage.
- Even more storage can be stuffed on a node, and at times is. But at a certain point there’s so much data on a node that recovery from node failure is too forbidding.
- There are some experiments with 10 GigE.
|Categories: Cloudera, Data warehouse appliances, Hadoop, MapR, Solid-state memory, Storage||7 Comments|
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
|Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text||4 Comments|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways. Read more
|Categories: Buying processes, Cloudera, Clustering, Hadoop, HBase, Hortonworks, MapR, MapReduce, Open source||8 Comments|
I visited my clients at Cloudera and Hortonworks last week, along with scads of other companies. A few of the takeaways were:
- Cloudera now has 220 employees.
- Cloudera now has over 100 subscription customers.
- Over the past year, Cloudera has more than doubled in size by every reasonable metric.
- Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July.
- Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports.
- Cloudera gets the majority of its revenue from subscriptions. However, professional services and training continue to be big businesses too.
- Cloudera has trained over 12,000 people.
- Hortonworks is training people too.
- Hortonworks now has 70 employees, and plans to have 100 or so by the end of this quarter.
- A number of those Hortonworks employees are executives who come from seriously profit-oriented backgrounds. Hortonworks clearly has capitalist intentions.
- Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
- There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.
- I met with 4 young-company clients who I regard as building vertical analytic stacks (WibiData, MarketShare, MetaMarkets, and ClearStory). All 4 are heavily dependent on Hadoop. (The same isn’t as true of older companies who built out a lot of technology before Hadoop was invented.)
- There should be more HBase information at HBaseCon on May 22.
- If MapR still has momentum, nobody I talked with has noticed.
|Categories: Amazon and its cloud, ClearStory Data, Cloud computing, Cloudera, Hadoop, HBase, Hortonworks, MapR, MapReduce, Market share and customer counts, Petabyte-scale data management, WibiData||1 Comment|
Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn’t prove stable, here also is a registration-required link from IBM’s Conor O’Mahony.) My comments include:
- The Forrester Wave’s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.
- The Forrester Wave for “enterprise Hadoop” contradicts itself on the subject of Hortonworks.
- The Forrester Wave for “enterprise Hadoop” is correct when it says “Hortonworks … has Hadoop training and professional services offerings that are still embryonic.”
- Peculiarly, the Forrester Wave for “enterprise Hadoop” also says “Hortonworks offers an impressive Hadoop professional services portfolio”. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are … well, a good word might be “embryonic”.
- Forrester Waves always seem to have weird implicit definitions of “data warehousing”. This one is no exception.
- Forrester gave top marks in “Functionality” to 11 of 13 “enterprise Hadoop” vendors. This seems odd.
- I don’t know why MapR, which doesn’t like HDFS (Hadoop Distributed File System), got top marks in “Subproject integration”.
- Forrester gave top marks in “Storage” to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum’s technology is a superset of MapR’s. Very strange. (Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)
- Forrester gave higher marks in “Acceleration and optimization” to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.
- I’m not sure what Forrester is calling a “Distributed EDW file store connector”, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.
- Forrester’s “Strategy” rankings seem to correlate to a metric of “We’re a large enough vendor to go in N directions at once”, for various values of N.
- Forrester is correct to rank Cloudera’s “Adoption” as being stronger than EMC/Greenplum’s or MapR’s. But Hortonworks’ strong mark for “Adoption” baffles me.
|Categories: Cloudera, Data warehousing, EMC, Greenplum, Hadoop, Hortonworks, MapR, MapReduce, Pentaho||11 Comments|
1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says “Hooray! Our appliance is all-in-one!” Big whoop.
2. That said, the Hadoop part of EMC ‘s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don’t reply “MapR is full of &#$!” Rather, they say “We’re going to close the gap with MapR a lot faster than the MapR folks like to think — and by the way, guys, thanks for the butt-kick.” A lot more precision about MapR may be found in this M. C. Srivas SlideShare.
|Categories: Data warehouse appliances, eBay, EMC, Greenplum, Hadoop, MapR, MapReduce, Open source, Oracle||2 Comments|
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
|Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics||5 Comments|
Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:
- Cloudera’s proprietary code is focused on management, set-up, etc.
- The “Phase 1″ plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2″.*
- MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
- So does Hadapt, but mainly vs. Hive.
- Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.
(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)
|Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset||20 Comments|