Kafka Consumer Lag Increasing: Troubleshooting Guide
Dev
Last updated on

Kafka Consumer Lag Increasing: Troubleshooting Guide


When Kafka consumer lag keeps increasing, the first useful question is not “how do I reset lag?” but “why is the consumer falling behind incoming records?”

Start with the fastest checks first: confirm the consumer is polling regularly, confirm processing is keeping up, and confirm the group is not burning too much time on rebalances or downstream waits.

If you are not fully sure the symptom belongs to Kafka instead of RabbitMQ or Redis, use the broader Middleware Troubleshooting Guide to choose the right branch first.


What consumer lag usually means

At a practical level, lag means the consumer is behind on at least some partitions.

Apache Kafka’s monitoring docs recommend watching consumer-side max lag and fetch rate together. For a consumer to keep up, max lag should stay below your threshold and minimum fetch rate should stay above zero.

That is the right starting frame: lag is usually a throughput or behavior problem, not just a number to clear.


Check poll timing first

Kafka consumer configs document max.poll.interval.ms as the maximum delay between poll() calls before the consumer is considered failed and the group rebalances.

That matters because a slow processing loop can look like “lag is growing” when the real issue is:

  • the consumer is not calling poll() often enough
  • rebalances keep interrupting work
  • the app is doing too much processing between polls

If lag is rising and the consumer is also rebalancing, this is one of the first places to look.


Compare processing speed with publish rate

Lag grows whenever consumption throughput is lower than production throughput.

Common reasons:

  • downstream database or API calls are slow
  • one partition gets heavier traffic than others
  • message handlers do too much work synchronously
  • batch size and poll cadence are poorly tuned

The important thing is to treat lag as a symptom of imbalance, not as an isolated metric.


Check a few consumer configs that often matter

Kafka docs make a few settings especially relevant here:

  • max.poll.interval.ms
  • max.poll.records
  • max.partition.fetch.bytes
  • heartbeat and session timing within group management

You do not want to “tune everything.” You want to ask whether the current values fit the actual processing cost of each batch.


Watch for rebalance side effects

Increasing lag often appears together with unstable consumer groups.

Typical pattern:

  1. processing slows down
  2. poll timing worsens
  3. the group rebalances
  4. progress stalls again
  5. lag rises further

If you suspect this loop, do not look only at lag graphs. Look at rebalance behavior and consumer-group stability too.

If records seem to stop reaching the application entirely rather than only falling behind, compare the same incident with Kafka Messages Not Consumed.


A practical debugging order

  1. confirm lag by partition, not just one total number
  2. check whether consumers are polling regularly
  3. inspect processing latency of the handler
  4. look for rebalance frequency
  5. review max.poll.interval.ms and max.poll.records
  6. compare input rate to actual consume rate

That order usually gets closer to root cause than offset-reset actions.


Common causes

1. Slow business logic

The consumer is alive, but the handler simply cannot keep up.

2. Poll loop starvation

The app waits too long between poll() calls.

3. Frequent rebalances

Useful work is repeatedly interrupted.

4. Uneven partition load

One partition becomes the bottleneck even when the overall cluster looks healthy.

If that checklist points to churn, poll timing, or idle consumers, the linked Kafka follow-up guides will usually narrow the incident faster than offset actions.


Symptom shortcut

  • Start here if consumer lag keeps increasing even though the consumer group still looks alive.
  • If records stop arriving entirely instead of just falling behind, the messages-not-consumed guide may be the better entry point.

Quick commands

kafka-consumer-groups.sh --bootstrap-server <broker:9092> --group <group> --describe
kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
kafka-configs.sh --bootstrap-server <broker:9092> --entity-type topics --entity-name <topic> --describe

Use these to inspect lag by partition, confirm partition layout, and compare the topic configuration with the throughput pattern you expect.

Look for one partition lagging much more than others, stalled current offsets, and topic or consumer settings that do not match handler speed.


FAQ

Q. Is increasing lag always a Kafka cluster problem?

No. It is often an application throughput problem or consumer-group behavior problem.

Q. Which setting should I inspect first?

Usually start with max.poll.interval.ms, max.poll.records, and actual handler latency.

Q. Should I reset offsets to make lag disappear?

Only after you understand the cause. Resetting offsets changes the symptom, not necessarily the reason it happened.


Sources:

Start Here

Continue with the core guides that pull steady search traffic.