When Kafka consumer lag keeps increasing, the first useful question is not “how do I reset lag?” but “why is the consumer falling behind incoming records?”
Start with the fastest checks first: confirm the consumer is polling regularly, confirm processing is keeping up, and confirm the group is not burning too much time on rebalances or downstream waits.
If you are not fully sure the symptom belongs to Kafka instead of RabbitMQ or Redis, use the broader Middleware Troubleshooting Guide to choose the right branch first.
What consumer lag usually means
At a practical level, lag means the consumer is behind on at least some partitions.
Apache Kafka’s monitoring docs recommend watching consumer-side max lag and fetch rate together. For a consumer to keep up, max lag should stay below your threshold and minimum fetch rate should stay above zero.
That is the right starting frame: lag is usually a throughput or behavior problem, not just a number to clear.
Check poll timing first
Kafka consumer configs document max.poll.interval.ms as the maximum delay between poll() calls before the consumer is considered failed and the group rebalances.
That matters because a slow processing loop can look like “lag is growing” when the real issue is:
- the consumer is not calling
poll()often enough - rebalances keep interrupting work
- the app is doing too much processing between polls
If lag is rising and the consumer is also rebalancing, this is one of the first places to look.
Compare processing speed with publish rate
Lag grows whenever consumption throughput is lower than production throughput.
Common reasons:
- downstream database or API calls are slow
- one partition gets heavier traffic than others
- message handlers do too much work synchronously
- batch size and poll cadence are poorly tuned
The important thing is to treat lag as a symptom of imbalance, not as an isolated metric.
Check a few consumer configs that often matter
Kafka docs make a few settings especially relevant here:
max.poll.interval.msmax.poll.recordsmax.partition.fetch.bytes- heartbeat and session timing within group management
You do not want to “tune everything.” You want to ask whether the current values fit the actual processing cost of each batch.
Watch for rebalance side effects
Increasing lag often appears together with unstable consumer groups.
Typical pattern:
- processing slows down
- poll timing worsens
- the group rebalances
- progress stalls again
- lag rises further
If you suspect this loop, do not look only at lag graphs. Look at rebalance behavior and consumer-group stability too.
If records seem to stop reaching the application entirely rather than only falling behind, compare the same incident with Kafka Messages Not Consumed.
A practical debugging order
- confirm lag by partition, not just one total number
- check whether consumers are polling regularly
- inspect processing latency of the handler
- look for rebalance frequency
- review
max.poll.interval.msandmax.poll.records - compare input rate to actual consume rate
That order usually gets closer to root cause than offset-reset actions.
Common causes
1. Slow business logic
The consumer is alive, but the handler simply cannot keep up.
2. Poll loop starvation
The app waits too long between poll() calls.
3. Frequent rebalances
Useful work is repeatedly interrupted.
4. Uneven partition load
One partition becomes the bottleneck even when the overall cluster looks healthy.
If that checklist points to churn, poll timing, or idle consumers, the linked Kafka follow-up guides will usually narrow the incident faster than offset actions.
Symptom shortcut
- Start here if consumer lag keeps increasing even though the consumer group still looks alive.
- If records stop arriving entirely instead of just falling behind, the messages-not-consumed guide may be the better entry point.
Quick commands
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --group <group> --describe
kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
kafka-configs.sh --bootstrap-server <broker:9092> --entity-type topics --entity-name <topic> --describe
Use these to inspect lag by partition, confirm partition layout, and compare the topic configuration with the throughput pattern you expect.
Look for one partition lagging much more than others, stalled current offsets, and topic or consumer settings that do not match handler speed.
FAQ
Q. Is increasing lag always a Kafka cluster problem?
No. It is often an application throughput problem or consumer-group behavior problem.
Q. Which setting should I inspect first?
Usually start with max.poll.interval.ms, max.poll.records, and actual handler latency.
Q. Should I reset offsets to make lag disappear?
Only after you understand the cause. Resetting offsets changes the symptom, not necessarily the reason it happened.
Read Next
- If rebalance churn is part of the same incident, open Kafka Rebalancing Too Often next.
- If the real bottleneck looks like slow work between polls, open Kafka max.poll.interval.ms Troubleshooting next.
- If the consumer looks idle rather than only slow, open Kafka Messages Not Consumed next.
- If you want a cross-broker backlog comparison, read RabbitMQ Queue Keeps Growing.
- If you want to step back to the wider routing map, go back to the Middleware Troubleshooting Guide.
Related Posts
- Kafka Rebalancing Too Often
- Kafka max.poll.interval.ms Troubleshooting
- Kafka Messages Not Consumed
- Kafka Producer Retries Too Much
- Middleware Troubleshooting Guide
Sources:
- https://kafka.apache.org/40/configuration/consumer-configs/
- https://kafka.apache.org/42/operations/
- https://kafka.apache.org/36/operations/monitoring/
While AdSense review is pending, related guides are shown instead of ads.
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Redis vs RabbitMQ vs Kafka A practical middleware troubleshooting guide for developers covering when to reach for Redis, RabbitMQ, or Kafka symptoms first, and which problem patterns usually belong to each tool.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Kafka Rebalancing Too Often: Common Causes and Fixes A practical Kafka troubleshooting guide covering why consumer groups rebalance too often, what poll timing and group protocol settings matter, and how to stop rebalances from interrupting useful work.
- Docker Container Keeps Restarting: What to Check First A practical Docker restart-loop troubleshooting guide covering exit codes, command failures, environment mistakes, health checks, and what to inspect first.
While AdSense review is pending, related guides are shown instead of ads.