Apache Kafka is a high-throughput distributed messaging system, offering strong durability and fault tolerance guarantees. Kafka is a master piece among distributed systems.
Today we want to share with the community a subtle aspect of Kafka fault tolerance that you may be underestimating.
Kafka libraries for producing messages are not as fault tolerant as you might expect. Depending on the use case, this may or may not be a serious problem.
Our requirements are the following:
- Messages must not be lost as long as the cluster is healthy.
- Prefer guaranteed delivery over load balancing.
- Rely on the Kafka library (kafka-clients, rdkafka, ...) without additional logic.
If your case meets the above points as well, keep reading the post.
This is the scenario: 3 Kafka servers (k1, k2, k3) and 1 producer (p1). The 3 Kafka servers are healthy, while the communication between p1 and k1 is broken. We say that k1 is unreachable from p1.
The topic the producer wants to send messages to has 3 partitions. Broker k1 leads partition 1, k2 leads partition 2, and k3 leads partition 3.
The producer contacts the bootstrap servers (k2 or k3) and receives the metadata about the topic leaders and partitions. In this case, 3 leaders and 3 partitions, even though p1 is not able to contact k1.
When the producer sends a message to k1, it fails after a timeout, and it doesn't retry sending the message to a different partition of a reachable leader.
It's a very simple failure scenario, yet there are chances that many software relying on the commonly used Kafka libraries are losing messages in this situation.
Kafka libraries usually have at least three strategies to decide the topic partitions: round robin, hashing or manual.
In the case of round robin strategy: catch the error and retry sending the message. The library will eventually proceed to the next available broker and succeed.
In the case of hashing strategy: catch the error, choose a different key and retry sending the message. Hopefully this results in choosing a partition of a reachable broker.
In the case of manual strategy: you have much more freedom because you choose the partition. You can mark the partition leader (k1 in this case) as unavailable for some time, so you can send messages only to partitions of reachable leaders.
In case you need load balancing even during this failure scenario: add the message to some kind of backlog, possibly backed on disk, then retry later. But this adds an incredibly complex layer to the application.
We expected Kafka producers to be internally fault tolerant in case a broker was unreachable. If you expected this too, you might be losing messages in this moment.
However implementing the fault tolerance outside Kafka libraries by yourself is not the best option. In our opinion these kind of failures should be solved in Kafka libraries. For this reason we opened JIRA KAFKA-3686.
Hopefully this post has been helpful to you. Distributed systems are hard because there are many cases to consider. In Immobiliare.it we always try to gather as much knowledge as we can about software solutions and their failure scenarios.