January 22, 2016

Cloudera in the cloud(s)

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Features new in this week’s release of Cloudera Director include:

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

That makes sense in part because:

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

Cloudera also offered a distinction among three types of workload:

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.


6 Responses to “Cloudera in the cloud(s)”

  1. David Gruzman on January 22nd, 2016 9:58 am

    I can explain about ” semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.”
    Move file in the regular file system is just change in the metadata. It not only happens in constant time, it is instant.
    As a result software designers are using it widely. For example – we write files in some directory and when all of them are ready – we just move them to the result place, thus avoiding having partial results in case of failure. Among others, Hadoop MapReduce is taking this approach…
    In the object store all file pat : s3://bucket_name/dir1/dir2/file_name.txt is actually one key which determine in which server the file have to be stored. So changing anything in the name will almost always require physical move of the file to new server…

  2. Rohit Agarwal on January 22nd, 2016 3:06 pm

    > Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack.

    Qubole as well. Qubole also takes real advantage of hourly pricing of cloud resources by having auto-scaling auto-terminating clusters.

  3. Curt Monash on January 22nd, 2016 7:21 pm


    Thanks. I knew I could count on you, and probably should just have asked offline. :)

    So what you’re saying is that HDFS allows a logical move without a corresponding physical one, but object stores do not?

  4. Andrew Wang on January 22nd, 2016 9:40 pm

    Regarding rename, yes, object stores often map keys to physical storage locations. Imagine a hashed scheme, where the key is hashed to determine what server has the value.

    David also identified the other big issue with object stores and rename: directory rename. In HDFS, directory rename is an atomic O(1) metadata-only operation. S3 doesn’t have directories, though users often still name their keys “/foo/bar/baz” for familiarity (and since S3 supports prefix listings). This means though that a “directory” rename in S3 involves renaming each key individually, which is O(n) metadata, and O(n) data since rename is really a copy. Apps can run into issues here since the rename is not atomic; blobs will start showing up in the destination part way through a directory rename.

    Netflix wrote a small open-source metadata layer on top of S3 called s3mper to fix the partial visibility issue for their MapReduce workflows.

  5. Kafka and more | DBMS 2 : DataBase Management System Services on January 25th, 2016 6:28 am

    […] Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called […]

  6. Steve Strongin on January 29th, 2016 2:12 am

    Sounds like Cloudera is trying to ride on the coattails of what the cloud big data companies are already doing….typical ;)

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.