October 16, 2014

Cloudera’s announcements this week

This week being Hadoop World, Cloudera naturally put out a flurry of press releases. In anticipation, I put out a context-setting post last weekend. That said, the gist of the news seems to be:

Notes on Cloudera Director start:

What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:

Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.


15 Responses to “Cloudera’s announcements this week”

  1. Robert Hodges on October 16th, 2014 11:33 pm

    Seems as if a number of storage vendors are also trying to decouple Hadoop storage from processing. It’s an interesting collision of processing models.

  2. David Gruzman on October 18th, 2014 6:20 am

    I think importance of data locality depends on engine performance. Hive will process,let say, 5-10 mb per second per core in simple query (like where with high selecticity). I can assume that amazon s3 can provide a few dozens of megabytes per second per server. So we didnt lost too much.
    For the same query Impala will do at least 100 mb per second per core. I doubt that s3 will feed 12 core server with 1.2 GB per second. More than that – even 10 gigabit network will not be enough.

  3. Robert Hodges on October 19th, 2014 8:07 pm

    @David Gruzman Agree. It is also hard to see how the decoupling really scales over time as analytic tools improve in speed. The S3 case you describe can lead to pathological performance bottlenecks on large data sets depending on how S3 distributes data.

  4. John on October 19th, 2014 8:09 pm


    On Amazon, you would be surprised at scan thru put of Amazon Redshift. It can easily go to 2.5 GB/sec/nod and no it doesn’t use S3 storage for scanning the data. Hadoop is a different animal as compared to a analytical DBMS. If you have Hive like or RDBMS like workload, you should stick to a analytical RDBMS and you will be happy with the performance.

    Hope this helps.

  5. David Gruzman on October 20th, 2014 1:56 am

    It is exactly my point – RedShift gains it’s amazing speed because of local storage usage. You could not dream about 2.5 GB/Sec from S3…

  6. John on October 20th, 2014 9:46 am

    Agreed, I think even with local storage , Hadoop falls short in performance for several other factors. The benchmarks run by likes of Impala are anything but real life scenarios. Looking deeper into HDFS components like block size, data co-location etc will make it much harder for Hadoop to solve interactive analytics. There is a clear use case for Hadoop and the eco system but lots of folks are being led to believe that Hadoop is here to replace Storage, processing as well as analytical DBMS.

  7. Matt Brandwein on October 20th, 2014 3:04 pm

    Re: “I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.”

    Good news! http://j.mp/1uuw2jD

    We’re pleased to announce that Microsoft Azure is now a preferred and certified Cloudera cloud platform. Mike Olson just appeared on stage with Satya Nadella to demonstrate Cloudera in the Azure marketplace and discuss the benefits of the partnership.

  8. Kris on October 21st, 2014 4:01 am

    @John: I agree that ‘traditional’ hadoop is not suitable for interactive analytics. But technologies like HDFS caching, parquet and heavy reliance on high amounts of memory make Impala an actual viable competitor in that area.

    According to our benchmarks, it was at least in the same ballpark, performance-wise, as Vertica.

    We’re doing some deeper tests on Impala now, and what bothers us the most, is its over-reliance on RAM. Some heavy in-memory queries can crash the entire impala server.

    Note that Hive 0.14 also promises serious performance improvements, and even CRUD. We will be looking at that next.

  9. Kris on October 21st, 2014 4:12 am

    What concerns me more and more about Cloudera, is their tendency to move to a closed-source walled garden.

    Several clients I talked to, are looking for alternatives. With Hive going through a lot of development lately, and technologies like Spark around the corner, the value proposition of Cloudera is decreasing.

    Furthermore, they have a per-node license. This hampers the one thing that hadoop is known for best: scalability. Your 30K license of last year becomes a 300K license today.

    I’m still a big fan. I like CDH and I like Impala. But I feel more and more, Cloudera is best suited for big corporations with deep pockets.

  10. John on October 21st, 2014 9:17 am

    Kris, There are workloads where one can use Impala or something similar to solve few problems. At the end ,once you look at all the features one may want in Impala (caching, parquet etc and in future WLM, full update capabilities etc) to be an alternative to the likes of Vertica or ParAccel or Redshift, it does start to look like a DBMS running on a distributed file system aka HDFS. Only difference is what FS are you running it on. The cost play is even more interesting, nothing beats AMAZON offerings for “Big data” which includes Redshift.

  11. Kris on October 22nd, 2014 4:50 am

    Agreed that Impala is simply a dbms on HDFS. It does integrate with YARN so it plays nicely in the Hadoop eco system.

    About pricing. How can redshift be cheaper than free? Surely, installing Impala on aws is cheaper than those same aws nodes + redshift costs?

    I think amazon also offers Impala as a service. Not sure of the pricing there.

  12. Matt Brandwein on October 22nd, 2014 2:57 pm

    @Kris, good comments, a couple things to highlight:
    * Impala 2.0 removes dependencies on RAM; would be curious to hear about your experiences with the latest release.
    * Cloudera also offers capacity (i.e. /TB) pricing.
    * Amazon EMR does indeed offer Impala support:

  13. Kris on October 22nd, 2014 7:10 pm

    @matt thanks for your reply.

    My experience so far is inreed, impala 2 spills to disk when needed. But Some queries won’t execute anymore. I’m testing all tpcds queries, and about 54 of them succeed on in memory data. If the data won’t fit in memory any more, it’s way less queries that succeed. Will publish first results, and code soon. Would be happy to talk them through with you guys to validate way of working and setup.

    Crashing the entire server is not an issue anymore since I correctly configured the memory limits. My bad.

  14. John on October 22nd, 2014 11:13 pm

    Kris, Redshift doesn’t price HW and SW separately. Some of my customers using Redshift with 1 to 2 yrs commitment are finding that price performance of Redshift especially against Impala is one fourth. There is nothing FREE. Impala doesn’t really do expansive analytics and is missing many integrations and features for a complete analytical solution. Full disclosure,I am an independent consultant in Big data technologies and have helped some enterprise customers evaluate Impala, Redshift and some other big data technologies, hence shared my experience. I am sure Impala will improve over time but I am not convinced YET that it can match the price performance of systems like Redshift for workloads which are beyond TPC-DS or TPC-H.

    Yarn is definitely promising but does it really help in resource allocation within Impala workload? I think the answer to that is no as of today!!!


  15. Cloudera in the cloud(s) | DBMS 2 : DataBase Management System Services on January 22nd, 2016 2:46 am

    […] Cloudera Director, which for example launches cloud instances. […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.