Feb 14, 2026

Last updated on Apr 14, 2026

Kafka Producer Retries: Read Timing and Guarantees Before Lowering the Number

Kafka producer retries often get treated like a tuning problem, but in real incidents retries are usually a timing symptom first. The producer is telling you that the delivery path needed more time, more acknowledgements, or a different guarantee budget than it was getting.

The short version is simple: do not lower or raise retries first. Read retries together with delivery.timeout.ms, request.timeout.ms, acknowledgement expectations, and broker health.

When this guide is the right fit

Start here if one of these sounds familiar:

producer retries suddenly spike even though the app code did not change
send latency and retry count rise together
broker pressure, leader movement, or partition events happened before retries climbed
the team is debating whether to change retries, acks, or producer timeouts
ordering and duplicate risk became unclear after retry-related config changes

What to check in the first 10 minutes

Start with one cluster-side view and one producer-side view.

kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>

Then compare that with the producer’s own retry and request-latency trend.

At this stage, answer only four questions:

did retries rise with request latency or acknowledgement delay?
did a leader move, broker restart, or ISR change happen first?
is the producer constrained by delivery.timeout.ms or request.timeout.ms?
are ordering and durability guarantees stronger than the current path can comfortably support?

`retries` is not the main timer

Kafka producer docs make a subtle but important point: users should generally prefer leaving retries unset and use delivery.timeout.ms to control retry behavior.

That changes how to read incidents:

retries is not the full delivery budget
the producer may keep retrying until the send succeeds, a non-transient error appears, or the delivery deadline expires
a retry spike often means the path slowed down before it means the retry count is wrong

So “should we lower retries?” is often the wrong first question.

`delivery.timeout.ms` is the real delivery budget

Kafka docs define delivery.timeout.ms as the upper bound on the total time to report success or failure after send() returns.

That budget includes:

time delayed before the record is sent
time waiting for broker acknowledgement
time spent on retriable failures

Kafka docs also say this value should be greater than or equal to request.timeout.ms + linger.ms.

That means a retry storm can be caused by either of these:

the path got slower
the budget became too small for the real path

`request.timeout.ms` can create unnecessary retries

Kafka docs say request.timeout.ms is how long the client waits for a request response before retrying or failing, and it should be larger than the broker’s replica.lag.time.max.ms to reduce unnecessary producer retries.

This is one of the easiest producer misreads:

broker replication is slower than usual
the request timeout is too tight
the client retries a path that might have succeeded with a slightly larger budget

So not every retry spike is a broker failure. Some are timeout-budget mismatches.

`acks` changes what “success” even means

Kafka producer docs are explicit here:

acks=0: the producer does not wait for acknowledgement and retries will not really help because the client will not generally know about failures
acks=1: the leader acknowledges after local write, before full follower replication
acks=all: the leader waits for the full in-sync replica set and gives the strongest guarantee

That means retries must be read against the chosen guarantee. A stricter acknowledgement path can surface timing pressure more clearly, which is often what you want, but it changes the incident shape.

Idempotence and ordering are part of the retry story

Current Kafka producer docs say idempotence is enabled by default if no conflicting configurations are set.

They also say:

idempotence requires acks=all
idempotence requires retries > 0
idempotence requires max.in.flight.requests.per.connection <= 5

There is also a critical ordering warning in the docs: if retries are allowed while enable.idempotence=false and max.in.flight.requests.per.connection > 1, records can be reordered after a failed send.

So a retry incident is never only about latency. It is also about what correctness guarantees the producer is trying to preserve.

A practical config block to read first

acks=all
enable.idempotence=true
delivery.timeout.ms=120000
request.timeout.ms=30000
max.in.flight.requests.per.connection=5

This is not a universal answer, but it shows the boundaries that most retry incidents revolve around.

Retry spikes usually fall into one of these patterns

Pattern	What it usually means	Better next step
retries rise with request latency	delivery path slowed down	inspect broker acknowledgement timing and ISR health
retries rise after leader movement	cluster topology changed first	inspect partition leaders and recent broker events
retries rise after timeout changes	the waiting budget changed	inspect `delivery.timeout.ms`, `request.timeout.ms`, `linger.ms`
retries rise with idempotence disabled and high in-flight requests	correctness risk increased	inspect ordering guarantees and duplicate tolerance
retries rise while the broker stays healthy	client-side budget or network timing may be too tight	inspect timeouts and network jitter

Do not confuse retries with failure rate

A retry spike can still mean the system is eventually succeeding.

That is why these comparisons matter more than the raw retry count:

retries versus request latency
retries versus send error types
retries versus broker restart or leadership events
retries versus partition health and ISR stability

If retries rose but end-to-end delivery still succeeds, the producer is often absorbing cluster stress rather than failing outright.

Common causes

1. Broker acknowledgements got slower

The cluster is still functioning, but the producer is waiting longer than usual.

2. Request timeout is too tight for current broker conditions

The client retries work that might have completed with a better timeout budget.

3. Leader movement or broker restart changed the delivery path

Retries are the producer-side shadow of a cluster event.

4. Stronger guarantees made timing pressure more visible

acks=all and idempotence are often correct, but they surface real delay instead of hiding it.

The incident becomes partly a correctness problem, not only a performance one.

Common wrong starts

lowering retries before reading delivery.timeout.ms
treating retries as a client-only metric
changing acks without deciding whether durability guarantees may weaken
disabling idempotence casually to quiet a symptom
ignoring leader movement and ISR changes on the broker side

A practical debugging order

1. Compare retries with request latency

If they moved together, the delivery path is the first suspect.

2. Check `delivery.timeout.ms` and `request.timeout.ms`

This tells you whether the producer had enough time budget for the real path.

3. Check `acks`, idempotence, and `max.in.flight.requests.per.connection`

This tells you what guarantees the producer is protecting.

4. Inspect recent broker, leader, and ISR events

Retries often start on the client only after the cluster changed first.

5. Decide whether the fix belongs in timing, broker health, or guarantees

Do not collapse those into one knob.

Checklist

I compared retries with request latency
I checked delivery.timeout.ms
I checked request.timeout.ms
I checked acks, idempotence, and in-flight request limits
I checked recent leader or ISR changes

FAQ

Q. Should I lower `retries` when the count gets high?

Usually not first. Kafka docs suggest using delivery.timeout.ms to control retry behavior instead of treating retries as the main lever.

Q. Can retries rise even when the cluster is still basically healthy?

Yes. The producer may simply be absorbing slower acknowledgements or temporary path pressure.

Q. Why does `acks=0` change the story so much?

Because the producer is not really waiting for broker acknowledgement, so retries will not help the same way.

Q. When do retries become an ordering risk?

When retries are enabled, idempotence is disabled, and max.in.flight.requests.per.connection is greater than 1.

Start Here

Continue with the core guides that pull steady search traffic.

Kafka Producer Retries: Read Timing and Guarantees Before Lowering the Number

When this guide is the right fit

What to check in the first 10 minutes

retries is not the main timer

delivery.timeout.ms is the real delivery budget

request.timeout.ms can create unnecessary retries

acks changes what “success” even means