- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)
Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.
The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:
- Guaranteed to have the last update of anything.
- Complete for the past N days.
Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:
- Kafka can offer strong guarantees of delivering data in the correct order.
- Persisted data is automagically broken up into partitions.
Technical tidbits include:
- Data is generally fresh to within 1.5 milliseconds.
- 100s of MB/sec/server is claimed. I didn’t ask how big a server was.
- LinkedIn runs >1 trillion messages/day through Kafka.
- Others in that throughput range include but are not limited to Microsoft and Netflix.
- A message is commonly 1 KB or less.
- At a guesstimate, 50%ish of messages are in Avro. JSON is another frequent format.
Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:
- Runs in different processes from core Kafka …
- … if it doesn’t actually run on a different cluster.
For example, connectors have their own pools of processes.
I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:
- Perhaps 80% of Kafka code comes from Confluent.
- LinkedIn has contributed most of the rest.
- However, as is typical in open source, the general community has contributed some connectors.
- The general community also contributes “esoteric” bug fixes, which Jay regards as evidence that Kafka is in demanding production use.
Jay has a rather erudite and wry approach to naming and so on.
- Kafka got its name because it was replacing something he regarded as Kafkaesque. OK.
- Samza is an associated project that has something to do with transformations. Good name. (The central character of The Metamorphosis was Gregor Samsa, and the opening sentence of the story mentions a transformation.)
- In his short book about logs, Jay has a picture caption “ETL in Ancient Greece. Not much has changed.” The picture appears to be of Sisyphus. I love it.
- I still don’t know why he named a key-value store Voldemort. Perhaps it was something not to be spoken of.
What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:
- Kafka is limited and simple. But of course Confluent has plans to broaden its capabilities.
- It’s long been hard to decide whether to talk about “events”, “streams” or both.
- “Streaming” has another tech meaning, in the context of video, songs, etc.
- One candidate name, “event hub”, has already been grabbed by IBM and Microsoft for their specific offerings.
- Naming is always hard in general.
Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.
And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.
- My October, 2014 post on Streaming for Hadoop is a sort of predecessor to this two-post series.