January 25, 2016

Kafka and Confluent

For starters:

At its core Kafka is very simple:

So it seems fair to say:

Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.

The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:

Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:

Technical tidbits include:

Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:

For example, connectors have their own pools of processes.

I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:

Jay has a rather erudite and wry approach to naming and so on.

What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:

Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.

And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.

Related links


10 Responses to “Kafka and Confluent”

  1. Kafka and more | DBMS 2 : DataBase Management System Services on January 25th, 2016 6:28 am

    […] a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture […]

  2. Mark Callaghan on January 25th, 2016 11:24 am

    If data is fresh within 1.5 milliseconds and this uses disk then the data must be visible before being durable on 1+ hosts. Fsync takes longer than that unless you have HW RAID cards or fast storage.

  3. Gwen Shapira on January 25th, 2016 12:59 pm


    You are right. In Kafka data is considered “committed” and therefore visible when it is acknowledged by leader and all synchronized brokers (administrators can control minimum number of these brokers). Acknowledge means “in memory” but not necessarily on disk.

    The idea is that if the minimum number of synced brokers is 3, it is unlikely that data that was visible to clients and then will get lost because all 3 crashed.

    Hope this clarifies?

  4. Patrick McFadin on January 25th, 2016 1:44 pm

    Just to add to what Gwen has said, this is the case with any distributed system that promotes durability. Well built systems will always account for flakey parts. Disks lie. Networks introduce latency. Servers fail. Replicating with the “Rule of 3” is always a good plan.

  5. Aaron on January 25th, 2016 5:15 pm

    Some notes:

    – Twitter uses Kafka (see https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time), as do most volume media
    – Kafka is a highly scalable message pub/sub layer. The use of the word ‘event’ is market speak until Kafka can do more than feed messages from source to target(s) – compare to message brokers which provide the ability to process data as opposed to Kafka which simply feeds to consumers at high volumes. For reference of what event processing is, compare to CEP systems which allow ETL-like workflows against streaming data, so they can correlate events.
    – Storm and Samza and Spark streaming (perhaps Flink) provide those ETL like capabilities, but need a resilient message broker like Kafka
    – Scale is a killer value add, Kafka being 2 orders of magnitude more scalable then whatever is in place. Other FOSS and proprietary message brokers cannot process at these volumes, forcing developers to hack to manage. Kafka is so scalable you can usually feed any data that is sharable through it – creating sharing/caring.

  6. Curt Monash on January 25th, 2016 9:00 pm


    I’m with Gwen and Patrick on this. The modern default is to declare victory when sufficiently many servers have what they need in RAM, without waiting for any of them to complete a write to persistent storage.

  7. me on January 26th, 2016 7:00 pm

    Kafka does support log.flush.interval.messages for per-message flushing, but as docs say its not as efficient replication, but can be turned ON

    from docs
    …We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the paranoid still may prefer having both and application level fsync policies are still supported….

  8. Big Analytics Roundup (February 1, 2016) | The Big Analytics Blog on February 1st, 2016 10:32 am

    […] DBMS2, Curt Monash covers Kafka and Confluent, which likely means that Confluent has hired Curt […]

  9. Kevin on April 5th, 2016 7:20 pm

    How is Kafka different than Fluentd? Both are opensource data collectors.

  10. Notes from a long trip, July 19, 2016 | DBMS 2 : DataBase Management System Services on July 19th, 2016 9:34 pm

    […] While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.