Kafka Consumer Rebalancing: Surviving the Storm
Consumer Group Rebalancing is one of the most common causes of latency spikes in Kafka-based systems. During a rebalance, consumers stop processing messages to redistribute partition ownership, leading to a "stop-the-world" effect.
1. Why do Rebalances happen?
- New Consumer Joins: Scaling out your processing layer.
- Consumer Leaves/Crashes: Scaling in or a failure event.
- Partition Change: Adding more partitions to a topic.
- Network Flaps: A consumer's heartbeat fails to reach the broker.
2. The Cost of Rebalancing
Traditionally, Kafka used the Eager Rebalance protocol. All consumers would revoke their partitions, wait for the Group Coordinator to assign new ones, and then resume. This results in:
- Zero Throughput: No messages are processed for seconds (or minutes).
- Lag Build-up: Downstream systems fall behind.
3. Cooperative Sticky Partitioning (Kafka 2.4+)
The modern solution is the Incremental Cooperative Rebalancing protocol.
- The Concept: Instead of revoking all partitions, consumers only revoke the specific partitions that need to be moved.
- Benefit: Most consumers continue processing uninterrupted. This transforms a massive "stop-the-world" event into a series of minor, localized shifts.
4. Tuning for Stability
To prevent unnecessary rebalances caused by transient network issues:
- heartbeat.interval.ms: Set this to 1/3 of the session timeout.
- session.timeout.ms: The time the broker waits for a heartbeat before declaring a consumer dead (default 10-45s).
- max.poll.interval.ms: The time allowed between calls to
poll(). If your processing logic is slow, increase this to prevent being kicked out of the group.
5. Static Group Membership (Kafka 2.3+)
By assigning a group.instance.id, you can make a consumer "static." If a static consumer restarts within its session timeout, it is allowed to rejoin the group without triggering a rebalance. This is a game-changer for rolling updates in Kubernetes.
Summary
Rebalances are inevitable, but their impact can be minimized. By upgrading to Cooperative Sticky Partitioning and leveraging Static Group Membership, you can maintain high throughput and low lag even during cluster maintenance and scaling events.
