Kafka producer retries often get treated like a tuning problem, but in real incidents retries are usually a timing symptom first. The producer is telling you that the delivery path needed more time, more acknowledgements, or a different guarantee budget than it was getting.
The short version is simple: do not lower or raise retries first. Read retries together with delivery.timeout.ms, request.timeout.ms, acknowledgement expectations, and broker health.
When this guide is the right fit
Start here if one of these sounds familiar:
- producer retries suddenly spike even though the app code did not change
- send latency and retry count rise together
- broker pressure, leader movement, or partition events happened before retries climbed
- the team is debating whether to change
retries,acks, or producer timeouts - ordering and duplicate risk became unclear after retry-related config changes
What to check in the first 10 minutes
Start with one cluster-side view and one producer-side view.
kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
Then compare that with the producer’s own retry and request-latency trend.
At this stage, answer only four questions:
- did retries rise with request latency or acknowledgement delay?
- did a leader move, broker restart, or ISR change happen first?
- is the producer constrained by
delivery.timeout.msorrequest.timeout.ms? - are ordering and durability guarantees stronger than the current path can comfortably support?
retries is not the main timer
Kafka producer docs make a subtle but important point: users should generally prefer leaving retries unset and use delivery.timeout.ms to control retry behavior.
That changes how to read incidents:
retriesis not the full delivery budget- the producer may keep retrying until the send succeeds, a non-transient error appears, or the delivery deadline expires
- a retry spike often means the path slowed down before it means the retry count is wrong
So “should we lower retries?” is often the wrong first question.
delivery.timeout.ms is the real delivery budget
Kafka docs define delivery.timeout.ms as the upper bound on the total time to report success or failure after send() returns.
That budget includes:
- time delayed before the record is sent
- time waiting for broker acknowledgement
- time spent on retriable failures
Kafka docs also say this value should be greater than or equal to request.timeout.ms + linger.ms.
That means a retry storm can be caused by either of these:
- the path got slower
- the budget became too small for the real path
request.timeout.ms can create unnecessary retries
Kafka docs say request.timeout.ms is how long the client waits for a request response before retrying or failing, and it should be larger than the broker’s replica.lag.time.max.ms to reduce unnecessary producer retries.
This is one of the easiest producer misreads:
- broker replication is slower than usual
- the request timeout is too tight
- the client retries a path that might have succeeded with a slightly larger budget
So not every retry spike is a broker failure. Some are timeout-budget mismatches.
acks changes what “success” even means
Kafka producer docs are explicit here:
acks=0: the producer does not wait for acknowledgement and retries will not really help because the client will not generally know about failuresacks=1: the leader acknowledges after local write, before full follower replicationacks=all: the leader waits for the full in-sync replica set and gives the strongest guarantee
That means retries must be read against the chosen guarantee. A stricter acknowledgement path can surface timing pressure more clearly, which is often what you want, but it changes the incident shape.
Idempotence and ordering are part of the retry story
Current Kafka producer docs say idempotence is enabled by default if no conflicting configurations are set.
They also say:
- idempotence requires
acks=all - idempotence requires
retries > 0 - idempotence requires
max.in.flight.requests.per.connection <= 5
There is also a critical ordering warning in the docs: if retries are allowed while enable.idempotence=false and max.in.flight.requests.per.connection > 1, records can be reordered after a failed send.
So a retry incident is never only about latency. It is also about what correctness guarantees the producer is trying to preserve.
A practical config block to read first
acks=all
enable.idempotence=true
delivery.timeout.ms=120000
request.timeout.ms=30000
max.in.flight.requests.per.connection=5
This is not a universal answer, but it shows the boundaries that most retry incidents revolve around.
Retry spikes usually fall into one of these patterns
| Pattern | What it usually means | Better next step |
|---|---|---|
| retries rise with request latency | delivery path slowed down | inspect broker acknowledgement timing and ISR health |
| retries rise after leader movement | cluster topology changed first | inspect partition leaders and recent broker events |
| retries rise after timeout changes | the waiting budget changed | inspect delivery.timeout.ms, request.timeout.ms, linger.ms |
| retries rise with idempotence disabled and high in-flight requests | correctness risk increased | inspect ordering guarantees and duplicate tolerance |
| retries rise while the broker stays healthy | client-side budget or network timing may be too tight | inspect timeouts and network jitter |
Do not confuse retries with failure rate
A retry spike can still mean the system is eventually succeeding.
That is why these comparisons matter more than the raw retry count:
- retries versus request latency
- retries versus send error types
- retries versus broker restart or leadership events
- retries versus partition health and ISR stability
If retries rose but end-to-end delivery still succeeds, the producer is often absorbing cluster stress rather than failing outright.
Common causes
1. Broker acknowledgements got slower
The cluster is still functioning, but the producer is waiting longer than usual.
2. Request timeout is too tight for current broker conditions
The client retries work that might have completed with a better timeout budget.
3. Leader movement or broker restart changed the delivery path
Retries are the producer-side shadow of a cluster event.
4. Stronger guarantees made timing pressure more visible
acks=all and idempotence are often correct, but they surface real delay instead of hiding it.
5. Retry-related configs changed ordering semantics
The incident becomes partly a correctness problem, not only a performance one.
Common wrong starts
- lowering
retriesbefore readingdelivery.timeout.ms - treating retries as a client-only metric
- changing
ackswithout deciding whether durability guarantees may weaken - disabling idempotence casually to quiet a symptom
- ignoring leader movement and ISR changes on the broker side
A practical debugging order
1. Compare retries with request latency
If they moved together, the delivery path is the first suspect.
2. Check delivery.timeout.ms and request.timeout.ms
This tells you whether the producer had enough time budget for the real path.
3. Check acks, idempotence, and max.in.flight.requests.per.connection
This tells you what guarantees the producer is protecting.
4. Inspect recent broker, leader, and ISR events
Retries often start on the client only after the cluster changed first.
5. Decide whether the fix belongs in timing, broker health, or guarantees
Do not collapse those into one knob.
Checklist
- I compared retries with request latency
- I checked
delivery.timeout.ms - I checked
request.timeout.ms - I checked
acks, idempotence, and in-flight request limits - I checked recent leader or ISR changes
FAQ
Q. Should I lower retries when the count gets high?
Usually not first. Kafka docs suggest using delivery.timeout.ms to control retry behavior instead of treating retries as the main lever.
Q. Can retries rise even when the cluster is still basically healthy?
Yes. The producer may simply be absorbing slower acknowledgements or temporary path pressure.
Q. Why does acks=0 change the story so much?
Because the producer is not really waiting for broker acknowledgement, so retries will not help the same way.
Q. When do retries become an ordering risk?
When retries are enabled, idempotence is disabled, and max.in.flight.requests.per.connection is greater than 1.
Read Next
- If retries rose after broker hotspots or leadership skew, continue with Kafka Leader Imbalance.
- If retries are paired with consumers falling behind, continue with Kafka Consumer Lag Increasing.
- If group churn is part of the same incident, continue with Kafka Rebalancing Too Often.
- If the visible symptom is consumers not making progress, continue with Kafka Messages Not Consumed.
Related Posts
Sources:
- https://kafka.apache.org/42/configuration/producer-configs/
- https://kafka.apache.org/42/operations/basic-kafka-operations/
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Where to Start With Redis, RabbitMQ, or Kafka A practical middleware troubleshooting hub covering how to choose the right first branch when systems using Redis, RabbitMQ, and Kafka show cache drift, queue backlog, or consumer lag.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Technical Blog SEO Checklist for Astro: What to Fix Before You Wait for Traffic A practical Astro SEO checklist for technical blogs covering deployed-site checks, robots.txt, sitemap, canonical, hreflang, structured data, page-role metadata, noindex decisions, and verification commands.
- Canonical and hreflang Setup for Multilingual Blogs: What to Check and What Breaks A practical guide to canonical and hreflang setup for multilingual blogs, covering self-canonicals, reciprocal hreflang clusters, x-default, category pages, rendered HTML checks, and the mistakes that make one language version suppress another.
- OpenAI Codex CLI Setup Guide: Install, Auth, and Your First Task A practical OpenAI Codex CLI setup guide covering installation, sign-in, the first interactive run, Windows notes, and the safest workflow for your first real task.