Kafka Leader Imbalance: Why Some Brokers Stay Hot
Dev
Last updated on

Kafka Leader Imbalance: Why Some Brokers Stay Hot


When a few Kafka brokers stay hotter than the rest after failures or restarts, the problem is often not random traffic growth. It is often a leadership distribution problem that never settled back into balance. Teams sometimes tune clients or add capacity even though the real issue is that some brokers are still carrying too much leader responsibility.

The short version: first separate genuine workload skew from uneven partition leadership, then compare hot brokers with restart history and preferred-replica behavior before tuning clients.


Quick Answer

If some Kafka brokers stay much hotter than others, first check whether they are carrying more partition leaders after a restart or failure event. Most incidents come from uneven leadership distribution, preferred-replica recovery that never normalized, or workload skew that operators only noticed after the cluster recovered.

What to Check First

  • did the hotspot pattern begin after broker restart, failure, or maintenance?
  • do the hottest brokers also hold more leaders?
  • is traffic uneven because of workload shape or because leadership stayed skewed?
  • are producer retries or latency concentrated on the same brokers?
  • is preferred-replica rebalancing expected and enabled for this cluster?

Start by separating workload skew from leadership skew

A hot broker does not always mean the workload itself is uneven.

Sometimes traffic really is concentrated on one topic or partition family. But just as often, the cluster recovered from disruption and leadership stayed uneven, so some brokers keep serving more client-facing work than others.

That distinction matters because one fix belongs in partitioning or workload design, while the other belongs in cluster leadership distribution.

What leader imbalance usually means

Kafka operations docs explain that after a broker restarts, it comes back as a follower for its partitions.

That means reads and writes still hit whichever brokers currently hold leadership. If leadership never settles back into a balanced shape, some brokers remain disproportionately hot while others do much less client-facing work.

Preferred replicas are part of the story

Kafka uses the idea of preferred replicas to restore healthier leadership distribution.

Kafka docs note that clusters can automatically try to move leadership back toward preferred replicas with auto.leader.rebalance.enable=true.

This is why leader imbalance is often not Kafka is just busy. It is the cluster recovered, but leadership never normalized.

Start with broker history and leadership distribution

Useful first questions:

  • did a broker recently restart or crash?
  • did leadership shift during an incident and never return?
  • are the hottest brokers simply the ones carrying more partition leaders now?
  • was traffic already uneven before the leadership change?

If the hotspot pattern appeared after disruption, leadership imbalance becomes a strong suspect very quickly.

Do not confuse leader imbalance with consumer lag

These problems can appear together, but they are not the same.

  • leader imbalance is about where partition leadership sits
  • consumer lag is about whether consumers keep up with records

One can influence the other, but the right debugging entry point is different.

Common causes

1. Broker restarts left leadership uneven

The cluster recovered enough to serve traffic, but leadership never returned to a balanced distribution.

2. Preferred-replica behavior is not restoring the expected shape

The cluster is healthy enough to operate, but not balanced the way operators expect.

3. Operators mistake hot leaders for generic traffic growth

The cluster looks busy, but only some brokers are disproportionately hot.

4. Producer pressure amplifies the imbalance

Retries, latency, and broker-side load become worse on brokers carrying too much leadership.

A quick triage table

SymptomMost likely branchCheck first
Hot brokers appeared after restartleadership skew after recoveryleader counts and broker event history
One topic family was already hotter before failuresworkload skewtopic and partition traffic distribution
Producer retries rose on the same brokersleader concentration under write loadleader placement and producer metrics
Cluster is healthy but heat stays unevenpreferred replicas never normalizedrebalance settings and leader distribution
Consumer lag and hot brokers appeared togetherrelated but separate issueswhether lag follows leadership skew or consumer churn

A practical debugging order

1. Compare hot brokers with leader placement

Do the busiest brokers also hold more leaders? If yes, leadership skew is no longer a vague suspicion.

2. Inspect restart and failure history

If the hotspot pattern started after a broker event, that timing matters more than a lot of guesswork.

3. Confirm whether preferred-replica rebalancing is expected in this cluster

You need to know what normal should look like in your environment before calling the current state wrong.

4. Compare workload heat with leadership distribution

This is where you separate real traffic skew from cluster-state skew.

5. Decide whether the fix belongs in cluster leadership or workload design

Do not tune producers first if the cluster itself is holding leadership unevenly.

Quick commands to ground the investigation

kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
kafka-configs.sh --bootstrap-server <broker:9092> --entity-type brokers --describe
grep -i leader <broker-log-file>

Use these to inspect partition leadership, broker configuration context, and logs showing leadership movement or imbalance.

A quick branch for the first pass

When some brokers stay hot, ask which of these became visible first:

  • broker restart or failure
  • uneven leader counts
  • producer retries and latency on the same nodes
  • obvious traffic skew before any broker event

That branch usually tells you whether to investigate cluster recovery behavior or workload design first.

Bottom Line

Do not treat hot Kafka brokers as automatic proof that the cluster needs more capacity. First prove whether leadership distribution became uneven after recovery, then separate that from true workload skew. If leadership is the real issue, client tuning and capacity changes will only hide the imbalance for a while.

FAQ

Q. Does leader imbalance always mean Kafka is broken?

No. It often means the cluster recovered from disruption but leadership stayed uneven.

Q. What is the fastest first step?

Check whether the hotspot pattern appeared after broker restarts or failures.

Q. What should I compare this with next?

Usually Kafka Producer Retries Too Much or Kafka Consumer Lag Increasing, depending on which symptom is visible first.

Q. When should I stop blaming workload skew?

When the hottest brokers clearly also hold a disproportionate number of leaders after a recovery event.

Sources:

Start Here

Continue with the core guides that pull steady search traffic.