Skip to content

Consumer Rebalance Storms: Why Your Kafka Consumer Group Never Stabilizes

A deep dive into Kafka consumer rebalance storms — the exact CommitFailedException and JoinGroup log spam that signals it, why max.poll.interval.ms and eager rebalancing are usually to blame, and how to fix it with cooperative sticky assignment and static membership.

Gopi Gorantala
Gopi Gorantala
8 min read

1. The Error / Issue

If you run consumer groups long enough, you've seen this in your logs:

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already
rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll()
was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much
time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum
size of batches returned in poll() with max.poll.records.

Alongside it, on the broker side (group coordinator logs), you'll see the group cycling endlessly:

[GroupCoordinator 2]: Preparing to rebalance group 'order-events-consumer' in state PreparingRebalance with old
generation 47 (__consumer_offsets-12) (reason: Adding new member consumer-3-a1b2c3-d4e5 with group instance id None)
[GroupCoordinator 2]: Stabilized group 'order-events-consumer' generation 48 (__consumer_offsets-12)
[GroupCoordinator 2]: Preparing to rebalance group 'order-events-consumer' in state PreparingRebalance with old
generation 48 (__consumer_offsets-12) (reason: Removing member consumer-3-a1b2c3-d4e5 on heartbeat expiration)

Generation numbers keep incrementing, throughput craters, and the group never settles. This is a rebalance storm: the group repeatedly enters PreparingRebalanceCompletingRebalanceStable, only to be kicked back into PreparingRebalance seconds later because a member drops out (or joins) before the group ever stabilizes.

Versions where this is common:

  • Kafka broker 2.x–3.x (the underlying eager-rebalance protocol issue predates KIP-429 and is still the default consumer partition-assignment behavior unless explicitly changed)
  • kafka-clients 2.0–2.4 (before cooperative sticky assignor was available); still reproducible on 3.x if you keep RangeAssignor/RoundRobinAssignor and don't tune max.poll.interval.ms
  • Spring Kafka 2.x/3.x wrapping any of the above — the symptom shows up identically since Spring just delegates to the underlying consumer's poll loop
  • Typical setup: a consumer group with 10+ members, topic with 50+ partitions, max.poll.records left at a large default (500) while per-record processing does synchronous I/O (DB writes, HTTP calls to a downstream service)

2. How to Reproduce It (Step-by-Step)

Local cluster

# docker-compose.yml
version: '3.8'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    depends_on: [zookeeper]
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_MIN_SESSION_TIMEOUT_MS: 6000
      KAFKA_GROUP_MAX_SESSION_TIMEOUT_MS: 300000
docker compose up -d

docker exec -it $(docker ps -qf name=kafka) kafka-topics \
  --bootstrap-server localhost:9092 \
  --create --topic order-events --partitions 24 --replication-factor 1

Trigger the storm

Run four or five consumer instances in the same group, each with a poll loop that occasionally does slow, blocking work — simulating a downstream call that stalls:

Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-events-consumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);       // default
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000); // default 5 min
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    "org.apache.kafka.clients.consumer.RangeAssignor"); // eager — the culprit

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("order-events"));

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    for (ConsumerRecord<String, String> record : records) {
        // Simulate a downstream call that occasionally stalls hard
        if (ThreadLocalRandom.current().nextInt(1000) == 0) {
            Thread.sleep(400_000); // exceeds max.poll.interval.ms
        }
        process(record);
    }
    consumer.commitSync();
}

Load the topic hard enough that every instance eventually hits the slow path:

kafka-producer-perf-test --topic order-events \
  --num-records 5000000 --record-size 512 --throughput 20000 \
  --producer-props bootstrap.servers=localhost:9092

Watch the group:

watch -n1 "kafka-consumer-groups --bootstrap-server localhost:9092 \
  --describe --group order-events-consumer --state"

You'll see PreparingRebalance reappear every time one instance breaches max.poll.interval.ms, and — critically — with RangeAssignor/RoundRobinAssignor (eager protocol), every member gives up all its partitions and rejoins from scratch on every single rebalance, even the members that never stalled. That's what turns a single slow consumer into a group-wide storm.

Environment-specific triggers:

  • Only visible under real partition/member counts — a 1-partition, 1-consumer toy setup won't reproduce it because there's nothing to redistribute
  • Gets dramatically worse with high partition counts (50+) because each eager rebalance pause is proportional to the cost of re-fetching and re-initializing state (Kafka Streams state stores make this an order of magnitude worse)
  • Consumer group churn — rolling deploys, autoscaling, pods restarting on OOM — is the most common real-world trigger, not just slow processing

3. Why It Happens — Surface Level

max.poll.interval.ms is the contract: if the client thread doesn't call poll() again within that window, the group coordinator treats the member as dead and evicts it. The eviction — one member leaving — is by itself supposed to be cheap. It becomes a storm when the default eager assignment protocol is used: eager assignors (RangeAssignor, RoundRobinAssignor) require all consumers to revoke all of their partitions at the start of every rebalance, then get reassigned once the group re-forms. So one flaky consumer causes the whole group to stop processing, not just the one partition set that needs to move.

4. Why It Happens — Architectural/Deeper Level

The consumer group protocol is coordinated by a broker acting as the group coordinator — the broker that owns the __consumer_offsets partition the group hashes to. Every member sends periodic heartbeats to this coordinator on a background thread; poll() on the main thread is what actually advances the group's liveness contract, because in the classic (non-KIP-848) protocol the heartbeat thread only confirms membership, while max.poll.interval.ms is enforced independently to guarantee the application thread is still making progress, not just the network thread.

When a member's poll() interval is exceeded, the coordinator marks it dead and increments the generation. Every generation bump requires:

  1. JoinGroup — all currently active members send their subscription metadata to the coordinator
  2. The elected group leader (the first member to join) runs the assignment strategy locally and computes partition ownership
  3. SyncGroup — the leader ships the computed assignment back through the coordinator to every member

With eager assignors, step 1 is preceded by every member revoking its entire partition set — onPartitionsRevoked fires for everything, consumers pause fetching, in-flight offsets get committed (or don't, if you're slow), and only after the new generation stabilizes does onPartitionsAssigned hand partitions back out, usually to different consumers than before. If your Kafka Streams app has RocksDB state stores backing those partitions, this also means state store rebuild/restoration from the changelog topic for every reassigned partition — which takes real wall-clock time proportional to state size, further extending the window where another member might also hit max.poll.interval.ms, compounding into a cascading storm where the group generation increments continuously and never reaches Stable long enough to make progress.

This is fundamentally a stop-the-world synchronization design: the eager protocol trades simplicity (no partial assignment state to reason about) for a global pause on every membership change. That tradeoff was fine when clusters had a handful of consumers and rebalances were rare. It falls apart under autoscaling, frequent deploys, or workloads with occasional slow records — exactly the conditions of a modern production system.

5. The Fix

Fix 1 — switch to the cooperative sticky assignor. This is the highest-leverage, lowest-risk change:

- props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
-     "org.apache.kafka.clients.consumer.RangeAssignor");
+ props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
+     "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

With cooperative rebalancing (KIP-429), only the partitions that actually need to move are revoked — everyone else keeps consuming uninterrupted during the rebalance. A single stalled consumer no longer stalls the group; it triggers a much cheaper, incremental reassignment of just its partitions.

Fix 2 — fix the actual root cause: processing time exceeding max.poll.interval.ms.

- props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);
- props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000);
+ props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50);
+ props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 600000);

Lowering max.poll.records shrinks the batch you must finish processing before the next poll(), directly reducing the odds of breaching the interval. Raising max.poll.interval.ms buys headroom for genuinely slow downstream calls, but don't use it to paper over unbounded processing time — it only delays detection of a stuck consumer.

Fix 3 — move slow work off the poll thread entirely, if per-record processing legitimately involves unpredictable I/O:

ExecutorService workers = Executors.newFixedThreadPool(8);

while (running) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    List<Future<?>> futures = new ArrayList<>();
    for (ConsumerRecord<String, String> record : records) {
        futures.add(workers.submit(() -> process(record)));
    }
    for (Future<?> f : futures) f.get(20, TimeUnit.SECONDS); // bound the wait
    consumer.commitSync();
}

Use case guidance:

  • Low-throughput, few consumers, occasional slow calls — Fix 2 alone (tune max.poll.records down, max.poll.interval.ms up) is usually sufficient.
  • High partition count, autoscaling groups, frequent deploys — Fix 1 is mandatory; you cannot autoscale safely on eager assignors.
  • Streams apps with heavy per-record I/O — combine Fix 1 with Fix 3, and consider async processing patterns (see below) rather than blocking the poll loop at all.

6. Best Alternative Approach / How to Re-Architect It

Patching max.poll.interval.ms and swapping the assignor treats the symptom. The architecturally correct fix is to decouple the poll loop from processing latency entirely and adopt static group membership so that routine restarts (deploys, autoscaling) don't trigger a rebalance at all.

Static membership (group.instance.id) tells the coordinator "this is the same logical member, just reconnecting" — a restart within session.timeout.ms doesn't revoke partitions or bump the generation:

Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-1:9092,kafka-2:9092,kafka-3:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-events-consumer");
props.put(ConsumerConfig.GROUP_INSTANCE_ID_CONFIG, "order-events-consumer-" + podOrdinal()); // static identity
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 45000);           // survive brief restarts
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 120000);

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("order-events"));

ExecutorService workers = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2);

while (running) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
    Map<TopicPartition, OffsetAndMetadata> toCommit = new ConcurrentHashMap<>();

    List<Future<?>> futures = new ArrayList<>();
    for (ConsumerRecord<String, String> record : records) {
        futures.add(workers.submit(() -> {
            process(record);
            toCommit.merge(
                new TopicPartition(record.topic(), record.partition()),
                new OffsetAndMetadata(record.offset() + 1),
                (oldVal, newVal) -> newVal.offset() > oldVal.offset() ? newVal : oldVal
            );
        }));
    }
    for (Future<?> f : futures) f.get(15, TimeUnit.SECONDS);
    consumer.commitSync(toCommit); // commit only what actually finished
}

For Kubernetes deployments, derive group.instance.id from the StatefulSet pod ordinal (hostname) so each replica keeps a stable identity across restarts — this is what turns routine rolling deploys from a rebalance-storm trigger into a non-event.

If you're on kafka-clients 3.7+, also evaluate KIP-848's new consumer group protocol (group.protocol=consumer), which replaces the JoinGroup/SyncGroup dance with per-member incremental reconciliation driven by the coordinator — no client-side assignor computation, no stop-the-world generation barrier at all. It's the long-term direction Kafka is heading, but validate it in staging first; it's still stabilizing across minor releases.

7. How to Prevent It Long-Term

  • Alert on rebalance rate, not just lag. Export kafka_consumer_coordinator_rebalance_rate_total (or scrape kafka-consumer-groups --describe on a cron) and page when a group rebalances more than once per few minutes outside of a known deploy window.
  • Track poll() cycle time as an application metric, not just consumer lag — lag is a lagging indicator; time-between-polls tells you you're about to breach max.poll.interval.ms before the coordinator kicks you out.
  • Standardize group.instance.id generation in your service template so every team's Kafka consumers get static membership by default, not opt-in per team.
  • Load-test rebalance behavior explicitly — kill/restart consumers under production-representative partition counts and processing latency in staging, and assert that throughput doesn't drop to zero during the transition.
  • Ban eager assignors in code review / lint rules for any new consumer group — default your internal Kafka client wrapper library to CooperativeStickyAssignor so it's not a per-team decision.
  • Set max.poll.interval.ms based on measured p99.9 processing time, not guesswork — instrument it, then set the threshold with real headroom.

8. Key Takeaways

  • A CommitFailedException plus a cycling generation number in group coordinator logs means members are being evicted for exceeding max.poll.interval.ms, and the eager assignor is turning one slow consumer into a group-wide stall.
  • Eager assignors (RangeAssignor, RoundRobinAssignor) revoke all partitions from all members on every rebalance — switch to CooperativeStickyAssignor to make rebalances incremental and non-disruptive.
  • Fix the actual root cause too: bound per-poll processing time with max.poll.records, move slow I/O off the poll thread, and size max.poll.interval.ms from measured latency, not defaults.
  • Static membership (group.instance.id) eliminates rebalances entirely for routine restarts and deploys — the biggest real-world trigger of rebalance storms in autoscaled environments.
  • Monitor rebalance rate and poll-cycle time as first-class metrics; lag alone won't catch this until throughput has already collapsed.

Gopi Gorantala Twitter

Gopi is an Engineering Manager with over 14 years of extensive expertise in Java-based applications. He resides in Europe and specializes in designing and scaling high-performance applications.

Comments