Kafka Leader Imbalance: Count Leaders Before Adding Brokers
Dev
Last updated on

Kafka Leader Imbalance: Count Leaders Before Adding Brokers


When a few Kafka brokers stay much hotter than the rest, teams often blame generic traffic growth or add capacity too early. In many incidents, the more useful first question is simpler: are those brokers carrying more partition leaders than the others?

The short version is simple: separate workload skew from leadership skew first. Hot brokers after restart or recovery often mean leadership never normalized back toward the preferred replica layout.


When this guide is the right fit

Start here if one of these sounds familiar:

  • a few brokers stay hot after restart, failure, or maintenance
  • restarted brokers come back cool while other brokers remain overloaded
  • producer latency or retries are concentrated on the same brokers
  • the team is not sure whether the cluster has a traffic-shape problem or a leader-placement problem
  • preferred-replica behavior is supposed to help, but the cluster still feels uneven

What to check in the first 10 minutes

Begin with topic description and broker config context:

kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>

Then compare hot brokers with their leader counts and recent restart history.

At this stage, answer only four questions:

  • did the hotspot start after restart, failure, or maintenance?
  • do the hottest brokers hold noticeably more leaders?
  • is the traffic itself skewed, or is leadership skewed?
  • is automatic leader rebalancing expected and enabled in this cluster?

Start by separating workload skew from leadership skew

This split saves a lot of wasted tuning:

What you seeWhat it usually meansBetter next step
one topic family was hot even before broker eventsworkload skewinspect partitioning, key distribution, traffic shape
brokers became uneven after restart or failureleadership skewinspect leader placement and recovery behavior
hot brokers also show higher producer latency or retriesclient traffic is following leader concentrationinspect leaders before tuning producers
restarted brokers are cool while others stay hotleadership did not shift backinspect preferred replicas and rebalance settings

If you do not know which branch you are on, client tuning and capacity decisions become guesswork.

After restart, a broker comes back as follower first

Kafka operations docs describe an important recovery detail: when a broker restarts, all partitions on that broker are initially leaderless on that node and the restarted broker begins as a follower for its partitions.

That means:

  • the broker can come back healthy
  • but it still does not handle the same client-facing leader load immediately
  • other brokers may keep serving reads and writes through their leader partitions

So “the broker is back” does not mean leadership is balanced again.

Preferred replicas are the cluster’s idea of normal

Kafka docs describe the preferred replica as the replica listed first for a partition. Clusters try to restore leadership to preferred replicas so leader load is distributed more evenly.

This is why hot-broker incidents often have a second phase:

  • the cluster recovered enough to serve traffic
  • but leadership never returned to the preferred shape
  • some brokers stayed hot while others stayed cool

auto.leader.rebalance.enable is the main background control

Kafka broker configs define auto.leader.rebalance.enable, and the default is true.

If enabled, the controller periodically checks leadership distribution. Two related broker configs shape that behavior:

  • leader.imbalance.check.interval.seconds
  • leader.imbalance.per.broker.percentage

Together they answer:

  • how often the cluster checks for imbalance
  • how much deviation is tolerated before rebalancing is considered necessary

If operators expect the cluster to heal leadership on its own, these are the first settings to confirm.

Leader count often explains broker heat better than CPU charts alone

A broker with more leaders usually handles more client-facing work:

  • producer requests go to leaders
  • consumers typically fetch from leaders
  • request handling, disk work, and network load concentrate there first

That is why broker heat and leader concentration often travel together. If the same nodes are hot and also own more leaders, leadership skew is no longer just a vague suspicion.

Producer symptoms often follow leader imbalance

Leader imbalance is not only a broker operations problem. It often leaks upward into application symptoms:

  • higher producer request latency on the hot brokers
  • retries concentrated around the same partitions or brokers
  • more visible pressure on topics led by the busiest brokers

That is why Kafka Producer Retries is often the next useful guide.

Controlled shutdown matters for planned maintenance

Kafka operations docs describe controlled shutdown as a way to move leadership away from a broker before it goes down, when replication conditions allow it.

That means planned maintenance and unclean failure produce different cluster shapes:

  • controlled shutdown tends to reduce abrupt leadership skew
  • crash or forced stop often leaves the cluster with a rougher leader distribution

If the hotspot started after planned work was done without a clean handoff, this branch is worth revisiting.

A practical broker config block to review

auto.leader.rebalance.enable=true
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10

These are not magic values, but they frame what the cluster is expected to do when leader load drifts.

Common causes

1. Restarted brokers returned as followers and stayed that way

The cluster recovered, but leader load never drifted back into balance.

2. Preferred leader recovery did not normalize the cluster

Operators expected automatic rebalancing, but the leader shape stayed skewed.

3. Real workload skew is being mistaken for leadership skew

The traffic pattern itself may still be uneven.

4. Hot leaders are amplifying producer-side symptoms

Latency and retries rise where the leader concentration sits.

5. Maintenance was not as graceful as operators assumed

Leader movement during restart or shutdown left the cluster with an uneven post-recovery state.

Common wrong starts

  • adding brokers before counting leaders
  • tuning producers before checking which brokers hold leadership
  • assuming “broker restarted successfully” means “leader balance recovered”
  • blaming consumer lag alone for a broker heat pattern
  • expecting preferred-replica healing without checking rebalance config

A practical debugging order

1. Compare broker heat with leader counts

If the hottest brokers also hold more leaders, the investigation gets much sharper immediately.

2. Check restart, failure, and maintenance history

Timing matters a lot in leadership incidents.

3. Check preferred-replica expectations and auto rebalance settings

You need to know what the cluster is supposed to do on its own.

4. Compare with workload shape

This is where you separate true traffic skew from recovery skew.

5. Decide whether the fix belongs in cluster recovery or workload design

Do not solve the wrong problem with more clients or more brokers.

Checklist

  • I compared hot brokers with leader counts
  • I checked whether the pattern started after restart or failure
  • I checked preferred-replica expectations
  • I checked auto.leader.rebalance.enable and related thresholds
  • I separated workload skew from leadership skew

FAQ

Q. Why does a restarted broker come back cool while others stay hot?

Because restarted brokers return as followers first and may not immediately regain leadership.

Q. Does hot-broker imbalance always mean traffic itself is uneven?

No. It often means leader placement stayed uneven after recovery.

Q. What broker settings matter most for automatic leader balancing?

auto.leader.rebalance.enable, leader.imbalance.check.interval.seconds, and leader.imbalance.per.broker.percentage.

Q. Why should producer retries be considered together with leader imbalance?

Because producer requests go to leaders, so hot leader concentration often shows up as higher latency and retries on the same brokers.

Sources:

Start Here

Continue with the core guides that pull steady search traffic.

Sponsored