I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say:
- If you are or want to be a serious MapReduce user – and you’re past the “play around over the weekend” stage — you probably should have either:
- A serious non-DBMS MapReduce distribution.
- MapReduce integrated into your analytic DBMS.
- The obvious choice for non-DBMS MapReduce is Hadoop.
- The obvious choice for a Hadoop distribution is Cloudera Enterprise.
- Cloudera Enterprise has three main aspects, in an inseparable bundle:
- Distributions for a double-digit number of open source projects. It’s nice having all that in one package – unless, of course, you like playing with Tinkertoys.
- Proprietary Cloudera code.
- Cloudera support.
- Cloudera says its proprietary code is and in the future is planned to be concentrated – at least in large part — on integrating open source technology with closed source products. This has the virtue of being targeted directly at that segment of the market which has proven it’s actually willing to pay money for software.
- Cloudera Enterprise areas of focus, now and in the presumed future, include:
- Core Hadoop engine, which Cloudera says is quite predictably and appropriately evolving more slowly than the tools around it.
- Development, management and administrative tools, including:
- Pig and Hive. Cloudera says >70% of Facebook Hadoop jobs are initiated through Hive, and the same is true of Yahoo and Pig.
- Connectivity to commercial tools.
- The product formerly known as “Cloudera Desktop.”
- Workflow, which in this context refers to letting you create a Hadoop application as a sequence of small steps, rather than forcing you to kluge it into being one unwieldy thing. At the moment, this is much less widely adopted than Pig and Hive, but Cloudera has high hopes for it, because of its obvious benefits in modularity and manageability.
- Quasi-DBMS technology. Besides Hive and Pig, this includes HBase. Cloudera says there has been considerable demand for HBase, and it is pleased that the project is now mature enough to ship. Cloudera stresses that it intends HBase not for OLTP, but as an adjunct to analytic processing. E.g., Cloudera suggests HBase would be a fine vehicle for replicating dimension tables across each node of a cluster.
- Data connectivity, e.g. to MySQL or to sensor log files.
- Cloudera Enterprise pricing is well below DBMS prices – not by a full order of magnitude, if I’m right about everybody’s quantity discount policies, but even so by a lot. Details are NDA.
Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:
- Somebody at Cloudera who comes primarily from the user and open source communities.
- Somebody at Cloudera who has actually worked at a software company before.
But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.
Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera’s own customer base — included:
- Notwithstanding eBay’s prior skepticism about MapReduce, it is quoted saying nice things in a Cloudera press release, and has apparently become quite a large Hadoop user, starting out with a search-quality use case.
- Typical Hadoop deployment sizes are 10 nodes or so when experimenting, 80-500+ in production.
- 10 terabytes/node – I’m pretty sure Cloudera meant of user data — is not inconceivable, so a cost-conscious 500-node user could have 5 petabytes of data managed by Hadoop.
- Cloudera has half a dozen customers at the 75+ node production level.
- Web and financial services are the two vertical markets moving most aggressively into Hadoop production. The government is also in significant Hadoop production, but the details of that are classified.
- Web uses for Hadoop include:
- Clickstream – sessionization, etc. – that’s a super-mainstream use.
- Search – analyzing search attempts in conjunction with structured data.
- Machine learning (for ad serving, etc.).
- Financial services uses for Hadoop include:
- Internal trading rule enforcement/fraud detection.
- Complex ETL.
- Portfolio risk assessment (typically overnight).
None of this is inconsistent with previous surveys of Hadoop use cases.
Various users talked at the Hadoop Summit this week. I wasn’t there, and won’t write about their stories for now. That said, Twitter’s slide deck from same has some interesting stuff, including:
- 7 TB/day ETLed from MySQL.
- Petabytes-being-stored accordingly coming soon.
- Open sourcing their ETL tool Crane.
- 3-4X LZO compression at little CPU cost.
- HBase is a more usable for them than HDFS, which isn’t mutable enough.
- Pig = 5% of code and coding effort vs. vanilla Hadoop at 30% or less performance hit.