Kafka Producer Retries: Read Timing and Guarantees Before Lowering the Number
Dev
Last updated on

Kafka Producer Retries: Read Timing and Guarantees Before Lowering the Number


Kafka producer retries often get treated like a tuning problem, but in real incidents retries are usually a timing symptom first. The producer is telling you that the delivery path needed more time, more acknowledgements, or a different guarantee budget than it was getting.

The short version is simple: do not lower or raise retries first. Read retries together with delivery.timeout.ms, request.timeout.ms, acknowledgement expectations, and broker health.


When this guide is the right fit

Start here if one of these sounds familiar:

  • producer retries suddenly spike even though the app code did not change
  • send latency and retry count rise together
  • broker pressure, leader movement, or partition events happened before retries climbed
  • the team is debating whether to change retries, acks, or producer timeouts
  • ordering and duplicate risk became unclear after retry-related config changes

What to check in the first 10 minutes

Start with one cluster-side view and one producer-side view.

kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>

Then compare that with the producer’s own retry and request-latency trend.

At this stage, answer only four questions:

  • did retries rise with request latency or acknowledgement delay?
  • did a leader move, broker restart, or ISR change happen first?
  • is the producer constrained by delivery.timeout.ms or request.timeout.ms?
  • are ordering and durability guarantees stronger than the current path can comfortably support?

retries is not the main timer

Kafka producer docs make a subtle but important point: users should generally prefer leaving retries unset and use delivery.timeout.ms to control retry behavior.

That changes how to read incidents:

  • retries is not the full delivery budget
  • the producer may keep retrying until the send succeeds, a non-transient error appears, or the delivery deadline expires
  • a retry spike often means the path slowed down before it means the retry count is wrong

So “should we lower retries?” is often the wrong first question.

delivery.timeout.ms is the real delivery budget

Kafka docs define delivery.timeout.ms as the upper bound on the total time to report success or failure after send() returns.

That budget includes:

  • time delayed before the record is sent
  • time waiting for broker acknowledgement
  • time spent on retriable failures

Kafka docs also say this value should be greater than or equal to request.timeout.ms + linger.ms.

That means a retry storm can be caused by either of these:

  • the path got slower
  • the budget became too small for the real path

request.timeout.ms can create unnecessary retries

Kafka docs say request.timeout.ms is how long the client waits for a request response before retrying or failing, and it should be larger than the broker’s replica.lag.time.max.ms to reduce unnecessary producer retries.

This is one of the easiest producer misreads:

  • broker replication is slower than usual
  • the request timeout is too tight
  • the client retries a path that might have succeeded with a slightly larger budget

So not every retry spike is a broker failure. Some are timeout-budget mismatches.

acks changes what “success” even means

Kafka producer docs are explicit here:

  • acks=0: the producer does not wait for acknowledgement and retries will not really help because the client will not generally know about failures
  • acks=1: the leader acknowledges after local write, before full follower replication
  • acks=all: the leader waits for the full in-sync replica set and gives the strongest guarantee

That means retries must be read against the chosen guarantee. A stricter acknowledgement path can surface timing pressure more clearly, which is often what you want, but it changes the incident shape.

Idempotence and ordering are part of the retry story

Current Kafka producer docs say idempotence is enabled by default if no conflicting configurations are set.

They also say:

  • idempotence requires acks=all
  • idempotence requires retries > 0
  • idempotence requires max.in.flight.requests.per.connection <= 5

There is also a critical ordering warning in the docs: if retries are allowed while enable.idempotence=false and max.in.flight.requests.per.connection > 1, records can be reordered after a failed send.

So a retry incident is never only about latency. It is also about what correctness guarantees the producer is trying to preserve.

A practical config block to read first

acks=all
enable.idempotence=true
delivery.timeout.ms=120000
request.timeout.ms=30000
max.in.flight.requests.per.connection=5

This is not a universal answer, but it shows the boundaries that most retry incidents revolve around.

Retry spikes usually fall into one of these patterns

PatternWhat it usually meansBetter next step
retries rise with request latencydelivery path slowed downinspect broker acknowledgement timing and ISR health
retries rise after leader movementcluster topology changed firstinspect partition leaders and recent broker events
retries rise after timeout changesthe waiting budget changedinspect delivery.timeout.ms, request.timeout.ms, linger.ms
retries rise with idempotence disabled and high in-flight requestscorrectness risk increasedinspect ordering guarantees and duplicate tolerance
retries rise while the broker stays healthyclient-side budget or network timing may be too tightinspect timeouts and network jitter

Do not confuse retries with failure rate

A retry spike can still mean the system is eventually succeeding.

That is why these comparisons matter more than the raw retry count:

  • retries versus request latency
  • retries versus send error types
  • retries versus broker restart or leadership events
  • retries versus partition health and ISR stability

If retries rose but end-to-end delivery still succeeds, the producer is often absorbing cluster stress rather than failing outright.

Common causes

1. Broker acknowledgements got slower

The cluster is still functioning, but the producer is waiting longer than usual.

2. Request timeout is too tight for current broker conditions

The client retries work that might have completed with a better timeout budget.

3. Leader movement or broker restart changed the delivery path

Retries are the producer-side shadow of a cluster event.

4. Stronger guarantees made timing pressure more visible

acks=all and idempotence are often correct, but they surface real delay instead of hiding it.

The incident becomes partly a correctness problem, not only a performance one.

Common wrong starts

  • lowering retries before reading delivery.timeout.ms
  • treating retries as a client-only metric
  • changing acks without deciding whether durability guarantees may weaken
  • disabling idempotence casually to quiet a symptom
  • ignoring leader movement and ISR changes on the broker side

A practical debugging order

1. Compare retries with request latency

If they moved together, the delivery path is the first suspect.

2. Check delivery.timeout.ms and request.timeout.ms

This tells you whether the producer had enough time budget for the real path.

3. Check acks, idempotence, and max.in.flight.requests.per.connection

This tells you what guarantees the producer is protecting.

4. Inspect recent broker, leader, and ISR events

Retries often start on the client only after the cluster changed first.

5. Decide whether the fix belongs in timing, broker health, or guarantees

Do not collapse those into one knob.

Checklist

  • I compared retries with request latency
  • I checked delivery.timeout.ms
  • I checked request.timeout.ms
  • I checked acks, idempotence, and in-flight request limits
  • I checked recent leader or ISR changes

FAQ

Q. Should I lower retries when the count gets high?

Usually not first. Kafka docs suggest using delivery.timeout.ms to control retry behavior instead of treating retries as the main lever.

Q. Can retries rise even when the cluster is still basically healthy?

Yes. The producer may simply be absorbing slower acknowledgements or temporary path pressure.

Q. Why does acks=0 change the story so much?

Because the producer is not really waiting for broker acknowledgement, so retries will not help the same way.

Q. When do retries become an ordering risk?

When retries are enabled, idempotence is disabled, and max.in.flight.requests.per.connection is greater than 1.

Sources:

Start Here

Continue with the core guides that pull steady search traffic.

Sponsored