Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.
Making Cloudera run in the cloud has three major aspects:
- Cloudera’s usual software, ported to run on the cloud platform(s).
- Cloudera Director, which for example launches cloud instances.
- Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.
Features new in this week’s release of Cloudera Director include:
- An API for job submission.
- Support for spot and preemptable instances.
- High availability.
- Some cluster repair.
- Some cluster cloning.
I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.
As for porting, let me start by noting:
- Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
- Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.
That makes sense in part because:
- The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
- By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
- The latter group of apps are probably the harder ones to optimize for.
But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:
- Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
- Object storage integration for Windows Azure is “in progress”.
- Object storage integration for Google GCP it is “to be determined”.
- Security for object storage — e.g. encryption — is a work in progress.
- Cloudera Navigator for object storage is a roadmap item.
When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.
*It’s rarely obvious to me why something is o(1) until it is explained to me.
Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:
- In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
- Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
- In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
- Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
- In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
- Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
- Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
- General Cloudera vs. Hortonworks distinctions.
Cloudera also offered a distinction among three types of workload:
- ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
- Cloudera pitches this as batch work.
- Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
- This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.
- Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
- BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
- “Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.
While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.