When a few Kafka brokers stay much hotter than the rest, teams often blame generic traffic growth or add capacity too early. In many incidents, the more useful first question is simpler: are those brokers carrying more partition leaders than the others?
The short version is simple: separate workload skew from leadership skew first. Hot brokers after restart or recovery often mean leadership never normalized back toward the preferred replica layout.
When this guide is the right fit
Start here if one of these sounds familiar:
- a few brokers stay hot after restart, failure, or maintenance
- restarted brokers come back cool while other brokers remain overloaded
- producer latency or retries are concentrated on the same brokers
- the team is not sure whether the cluster has a traffic-shape problem or a leader-placement problem
- preferred-replica behavior is supposed to help, but the cluster still feels uneven
What to check in the first 10 minutes
Begin with topic description and broker config context:
kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
Then compare hot brokers with their leader counts and recent restart history.
At this stage, answer only four questions:
- did the hotspot start after restart, failure, or maintenance?
- do the hottest brokers hold noticeably more leaders?
- is the traffic itself skewed, or is leadership skewed?
- is automatic leader rebalancing expected and enabled in this cluster?
Start by separating workload skew from leadership skew
This split saves a lot of wasted tuning:
| What you see | What it usually means | Better next step |
|---|---|---|
| one topic family was hot even before broker events | workload skew | inspect partitioning, key distribution, traffic shape |
| brokers became uneven after restart or failure | leadership skew | inspect leader placement and recovery behavior |
| hot brokers also show higher producer latency or retries | client traffic is following leader concentration | inspect leaders before tuning producers |
| restarted brokers are cool while others stay hot | leadership did not shift back | inspect preferred replicas and rebalance settings |
If you do not know which branch you are on, client tuning and capacity decisions become guesswork.
After restart, a broker comes back as follower first
Kafka operations docs describe an important recovery detail: when a broker restarts, all partitions on that broker are initially leaderless on that node and the restarted broker begins as a follower for its partitions.
That means:
- the broker can come back healthy
- but it still does not handle the same client-facing leader load immediately
- other brokers may keep serving reads and writes through their leader partitions
So “the broker is back” does not mean leadership is balanced again.
Preferred replicas are the cluster’s idea of normal
Kafka docs describe the preferred replica as the replica listed first for a partition. Clusters try to restore leadership to preferred replicas so leader load is distributed more evenly.
This is why hot-broker incidents often have a second phase:
- the cluster recovered enough to serve traffic
- but leadership never returned to the preferred shape
- some brokers stayed hot while others stayed cool
auto.leader.rebalance.enable is the main background control
Kafka broker configs define auto.leader.rebalance.enable, and the default is true.
If enabled, the controller periodically checks leadership distribution. Two related broker configs shape that behavior:
leader.imbalance.check.interval.secondsleader.imbalance.per.broker.percentage
Together they answer:
- how often the cluster checks for imbalance
- how much deviation is tolerated before rebalancing is considered necessary
If operators expect the cluster to heal leadership on its own, these are the first settings to confirm.
Leader count often explains broker heat better than CPU charts alone
A broker with more leaders usually handles more client-facing work:
- producer requests go to leaders
- consumers typically fetch from leaders
- request handling, disk work, and network load concentrate there first
That is why broker heat and leader concentration often travel together. If the same nodes are hot and also own more leaders, leadership skew is no longer just a vague suspicion.
Producer symptoms often follow leader imbalance
Leader imbalance is not only a broker operations problem. It often leaks upward into application symptoms:
- higher producer request latency on the hot brokers
- retries concentrated around the same partitions or brokers
- more visible pressure on topics led by the busiest brokers
That is why Kafka Producer Retries is often the next useful guide.
Controlled shutdown matters for planned maintenance
Kafka operations docs describe controlled shutdown as a way to move leadership away from a broker before it goes down, when replication conditions allow it.
That means planned maintenance and unclean failure produce different cluster shapes:
- controlled shutdown tends to reduce abrupt leadership skew
- crash or forced stop often leaves the cluster with a rougher leader distribution
If the hotspot started after planned work was done without a clean handoff, this branch is worth revisiting.
A practical broker config block to review
auto.leader.rebalance.enable=true
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
These are not magic values, but they frame what the cluster is expected to do when leader load drifts.
Common causes
1. Restarted brokers returned as followers and stayed that way
The cluster recovered, but leader load never drifted back into balance.
2. Preferred leader recovery did not normalize the cluster
Operators expected automatic rebalancing, but the leader shape stayed skewed.
3. Real workload skew is being mistaken for leadership skew
The traffic pattern itself may still be uneven.
4. Hot leaders are amplifying producer-side symptoms
Latency and retries rise where the leader concentration sits.
5. Maintenance was not as graceful as operators assumed
Leader movement during restart or shutdown left the cluster with an uneven post-recovery state.
Common wrong starts
- adding brokers before counting leaders
- tuning producers before checking which brokers hold leadership
- assuming “broker restarted successfully” means “leader balance recovered”
- blaming consumer lag alone for a broker heat pattern
- expecting preferred-replica healing without checking rebalance config
A practical debugging order
1. Compare broker heat with leader counts
If the hottest brokers also hold more leaders, the investigation gets much sharper immediately.
2. Check restart, failure, and maintenance history
Timing matters a lot in leadership incidents.
3. Check preferred-replica expectations and auto rebalance settings
You need to know what the cluster is supposed to do on its own.
4. Compare with workload shape
This is where you separate true traffic skew from recovery skew.
5. Decide whether the fix belongs in cluster recovery or workload design
Do not solve the wrong problem with more clients or more brokers.
Checklist
- I compared hot brokers with leader counts
- I checked whether the pattern started after restart or failure
- I checked preferred-replica expectations
- I checked
auto.leader.rebalance.enableand related thresholds - I separated workload skew from leadership skew
FAQ
Q. Why does a restarted broker come back cool while others stay hot?
Because restarted brokers return as followers first and may not immediately regain leadership.
Q. Does hot-broker imbalance always mean traffic itself is uneven?
No. It often means leader placement stayed uneven after recovery.
Q. What broker settings matter most for automatic leader balancing?
auto.leader.rebalance.enable, leader.imbalance.check.interval.seconds, and leader.imbalance.per.broker.percentage.
Q. Why should producer retries be considered together with leader imbalance?
Because producer requests go to leaders, so hot leader concentration often shows up as higher latency and retries on the same brokers.
Read Next
- If producer latency and retry counts rose on the same hot brokers, continue with Kafka Producer Retries.
- If consumer backlog is the bigger symptom, continue with Kafka Consumer Lag Increasing.
- If group instability is part of the same recovery story, continue with Kafka Rebalancing Too Often.
- If consumers stopped making progress after the broker event, continue with Kafka Messages Not Consumed.
Related Posts
Sources:
- https://kafka.apache.org/42/operations/basic-kafka-operations/
- https://kafka.apache.org/42/configuration/broker-configs
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Where to Start With Redis, RabbitMQ, or Kafka A practical middleware troubleshooting hub covering how to choose the right first branch when systems using Redis, RabbitMQ, and Kafka show cache drift, queue backlog, or consumer lag.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Technical Blog SEO Checklist for Astro: What to Fix Before You Wait for Traffic A practical Astro SEO checklist for technical blogs covering deployed-site checks, robots.txt, sitemap, canonical, hreflang, structured data, page-role metadata, noindex decisions, and verification commands.
- Canonical and hreflang Setup for Multilingual Blogs: What to Check and What Breaks A practical guide to canonical and hreflang setup for multilingual blogs, covering self-canonicals, reciprocal hreflang clusters, x-default, category pages, rendered HTML checks, and the mistakes that make one language version suppress another.
- OpenAI Codex CLI Setup Guide: Install, Auth, and Your First Task A practical OpenAI Codex CLI setup guide covering installation, sign-in, the first interactive run, Windows notes, and the safest workflow for your first real task.