October 5, 2014

Streaming for Hadoop

The genesis of this post is that:

Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.

In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are:

I guess there’s also a step of receiving data out of the plumbing system. Cloudera and I glossed over that aspect when we talked, but I’ll say:

Cloudera has not yet decided whether to make Kafka part of CDH (which stands for Cloudera Distribution yada yada Hadoop). Considerations in that probably include:

I still find it bizarre that a messaging system be named after an author famous for writing about depressingly inescapable situations. Also, I wish that:

Highlights from the Storm vs. Spark Streaming vs. Samza part of my discussion with Cloudera include:

Also, Spark Streaming has a major advantage over bare Storm in whether you have to manually configure your topology, but I wasn’t clear as to how far Trident closes that particular gap.

Cloudera and I didn’t particularly talk about data-consuming technologies such as BI, predictive analytics, or analytic applications, but we did review use cases a bit. Nothing too surprising jumped out. Indeed, the discussion reminded me of a 2007 list I did of applications — other than extreme low-latency ones — for CEP (Complex Event Processing).

In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data.

Edit: Shortly after I posted this, Storm creator Nathan Marz put up a detailed and optimistic post about the history and state of Storm.

Comments

9 Responses to “Streaming for Hadoop”

  1. Hadoop Happenings: Retail Use Cases; Apache Tez | Qubole on October 6th, 2014 3:11 pm

    […] DBMS2.com- Real-time analytics is gaining traction with vendors focusing on tools like Apache Storm, Flume, Kafka and Spark. Read More […]

  2. Notes and links, December 12, 2014 | DBMS 2 : DataBase Management System Services on December 12th, 2014 6:05 am

    […] 4. Scaling Data is on the bandwagon for Spark Streaming and Kafka. […]

  3. Notes on machine-generated data, year-end 2014 | DBMS 2 : DataBase Management System Services on January 9th, 2015 12:24 am

    […] What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will […]

  4. Where the innovation is | DBMS 2 : DataBase Management System Services on January 19th, 2015 2:21 pm

    […] is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric […]

  5. Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 9th, 2015 7:53 am

    […] expect a lot of innovation relevant to the analytic side, in areas such as streaming, low-latency BI, event series analytics, and BI/predictive modeling […]

  6. MemSQL 4.0 | DBMS 2 : DataBase Management System Services on May 20th, 2015 8:37 am

    […] idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and […]

  7. Rocana’s world | DBMS 2 : DataBase Management System Services on January 25th, 2016 5:16 am

    […] Rocana’s Hadoop stack presumably includes both Kafka and Spark Streaming. […]

  8. Kafka and more | DBMS 2 : DataBase Management System Services on January 25th, 2016 6:29 am

    […] October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post […]

  9. Kafka and Confluent | DBMS 2 : DataBase Management System Services on January 25th, 2016 6:40 am

    […] October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers


Warning: include(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(http://www.monash.com/blog-promo.php): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29

Warning: include(): Failed opening 'http://www.monash.com/blog-promo.php' for inclusion (include_path='.:/usr/lib/php:/usr/local/lib/php') in /home/dbms2cm/public_html/wp-content/themes/monash/static_sidebar.php on line 29