While working on Event Driven Architecture for my Notification System, I got to know the flakyness of the single broker Kafka setup in our organization. It used to crash frequently due to which teams were hesitant to use it and were relying extensively on Rest APIs.
Had a major discussion on this issue with the company's Architecture team and was allowed to setup a new highly-available Kafka Cluster for the org. Within a week I had successfully setup company's current eBH Kafka Cluster, setup across 3 VMs, having a replication factor of 3, to prevent any data loss in case of downtime of a VM and a minimum in-sync replicas of 2, to allow data consistency across Kafka brokers. One more key fix was to increase the ulimit of Ubuntu VM from its default value of 1024 to 10k, this ulimit was getting hit very easily as every partition for a topic opens up as a separate file in Kafka logs. I also came up with the approach of distributing Zookeeper ensemble across the same 3 VMs, making both Kafka and Zookeeper highly available. Tested the setup by restarting/killing the VMs one-by-one and observed that Kafka was still operational making the project a success.
This Kafka Cluster is now being used by the whole company and due to its high availability, all squads are moving towards making their own Event Driven Architectures!