The genesis of this post is that:
- Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
- Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
- This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
- This also all fits with the importance I place on log analysis.
- Cloudera reached out to talk to me about all this.
Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.
In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are:
- Getting data into the plumbing from whatever systems it’s being generated in. This is the province of Flume, one of Cloudera’s earliest projects. I’d add that this is also one of the core competencies of Splunk.
- Getting data where it needs to go. Flume can do this. Kafka, a publish/subscribe messaging system, can do it in a more general way, because streams are sent to a Kafka broker, which then re-streams them to their ultimate destination.
- Processing data in flight. Storm can do this. Spark Streaming can do it more easily. Spark Streaming is or soon will be a part of every serious Hadoop distribution. Flume can do some lightweight processing as well.
- Serving up data for further query. Cloudera would like you to do this via HBase or Impala. But Oracle is a fine choice too, and indeed a popular choice among Cloudera customers.
I guess there’s also a step of receiving data out of the plumbing system. Cloudera and I glossed over that aspect when we talked, but I’ll say:
- Spark commonly lives over HDFS (Hadoop Distributed File System).
- Flume feeds HDFS. Flume was also hacked years ago — rah-rah open source! — to feed Kafka instead, and also to be fed by it.
Cloudera has not yet decided whether to make Kafka part of CDH (which stands for Cloudera Distribution yada yada Hadoop). Considerations in that probably include:
- Kafka has impressive adoption among high-profile internet companies, but not so much among conventional enterprises.
- Surely not coincidentally, Kafka is missing features in areas such as security (e.g. it lacks Kerberos integration).
- Kafka lacks cool capabilities to let you configure rather than code, although Cloudera thinks that in some cases you can work around this problem by marrying Kafka and Flume.
I still find it bizarre that a messaging system be named after an author famous for writing about depressingly inescapable situations. Also, I wish that:
- Kafka had something to do with transformations.
- The name Kafka had been used by a commercial software company, which could offer product trials.
Highlights from the Storm vs. Spark Streaming vs. Samza part of my discussion with Cloudera include:
- Storm has a companion project Trident that makes Storm somewhat easier to program and/or configure. But Trident only has some of the usability advantages of Spark Streaming.
- Cloudera sees no advantages to Samza, a Kafka companion project, when compared with whichever of Spark Streaming or Storm + Trident is better suited to a particular use case.
- Cloudera likes the rich set of primitives that Spark Streaming inherits from Spark. Cloudera also notes that, if you learn to program over Spark for any reason, then you will in particular have learned how to program over Spark Streaming.
- Spark Streaming lets you join Spark Streaming data to other data that Spark can get access to. I agree with Cloudera that this is an important advantage.
- Cloudera sees Storm’s main advantages as being in latency. If you need 10-200 millisecond latency, Storm can give you that today while Spark Streaming can’t. However, Cloudera notes that to write efficiently to your persistent store — which Cloudera fondly hopes but does not insist will be HBase or Impala — you may need to micro-batch your writes anyway.
Also, Spark Streaming has a major advantage over bare Storm in whether you have to manually configure your topology, but I wasn’t clear as to how far Trident closes that particular gap.
Cloudera and I didn’t particularly talk about data-consuming technologies such as BI, predictive analytics, or analytic applications, but we did review use cases a bit. Nothing too surprising jumped out. Indeed, the discussion reminded me of a 2007 list I did of applications — other than extreme low-latency ones — for CEP (Complex Event Processing).
- Top-of-mind were things that fit into one or more of the buckets “internet”, “retail”, “recommendation/personalization”, “security” or “anti-fraud”.
- Transportation/logistics got mentioned, to which I replied that the CEP vendors had all seemed to have one trucking/logistics client each.
- At least in theory, there are potentially huge future applications in health care.
In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data.
Edit: Shortly after I posted this, Storm creator Nathan Marz put up a detailed and optimistic post about the history and state of Storm.