When a few Kafka brokers stay hotter than the rest after failures or restarts, the problem is often not random traffic growth. It is often a leadership distribution problem that never settled back into balance. Teams sometimes tune clients or add capacity even though the real issue is that some brokers are still carrying too much leader responsibility.
The short version: first separate genuine workload skew from uneven partition leadership, then compare hot brokers with restart history and preferred-replica behavior before tuning clients.
Quick Answer
If some Kafka brokers stay much hotter than others, first check whether they are carrying more partition leaders after a restart or failure event. Most incidents come from uneven leadership distribution, preferred-replica recovery that never normalized, or workload skew that operators only noticed after the cluster recovered.
What to Check First
- did the hotspot pattern begin after broker restart, failure, or maintenance?
- do the hottest brokers also hold more leaders?
- is traffic uneven because of workload shape or because leadership stayed skewed?
- are producer retries or latency concentrated on the same brokers?
- is preferred-replica rebalancing expected and enabled for this cluster?
Start by separating workload skew from leadership skew
A hot broker does not always mean the workload itself is uneven.
Sometimes traffic really is concentrated on one topic or partition family. But just as often, the cluster recovered from disruption and leadership stayed uneven, so some brokers keep serving more client-facing work than others.
That distinction matters because one fix belongs in partitioning or workload design, while the other belongs in cluster leadership distribution.
What leader imbalance usually means
Kafka operations docs explain that after a broker restarts, it comes back as a follower for its partitions.
That means reads and writes still hit whichever brokers currently hold leadership. If leadership never settles back into a balanced shape, some brokers remain disproportionately hot while others do much less client-facing work.
Preferred replicas are part of the story
Kafka uses the idea of preferred replicas to restore healthier leadership distribution.
Kafka docs note that clusters can automatically try to move leadership back toward preferred replicas with auto.leader.rebalance.enable=true.
This is why leader imbalance is often not Kafka is just busy. It is the cluster recovered, but leadership never normalized.
Start with broker history and leadership distribution
Useful first questions:
- did a broker recently restart or crash?
- did leadership shift during an incident and never return?
- are the hottest brokers simply the ones carrying more partition leaders now?
- was traffic already uneven before the leadership change?
If the hotspot pattern appeared after disruption, leadership imbalance becomes a strong suspect very quickly.
Do not confuse leader imbalance with consumer lag
These problems can appear together, but they are not the same.
- leader imbalance is about where partition leadership sits
- consumer lag is about whether consumers keep up with records
One can influence the other, but the right debugging entry point is different.
Common causes
1. Broker restarts left leadership uneven
The cluster recovered enough to serve traffic, but leadership never returned to a balanced distribution.
2. Preferred-replica behavior is not restoring the expected shape
The cluster is healthy enough to operate, but not balanced the way operators expect.
3. Operators mistake hot leaders for generic traffic growth
The cluster looks busy, but only some brokers are disproportionately hot.
4. Producer pressure amplifies the imbalance
Retries, latency, and broker-side load become worse on brokers carrying too much leadership.
A quick triage table
| Symptom | Most likely branch | Check first |
|---|---|---|
| Hot brokers appeared after restart | leadership skew after recovery | leader counts and broker event history |
| One topic family was already hotter before failures | workload skew | topic and partition traffic distribution |
| Producer retries rose on the same brokers | leader concentration under write load | leader placement and producer metrics |
| Cluster is healthy but heat stays uneven | preferred replicas never normalized | rebalance settings and leader distribution |
| Consumer lag and hot brokers appeared together | related but separate issues | whether lag follows leadership skew or consumer churn |
A practical debugging order
1. Compare hot brokers with leader placement
Do the busiest brokers also hold more leaders? If yes, leadership skew is no longer a vague suspicion.
2. Inspect restart and failure history
If the hotspot pattern started after a broker event, that timing matters more than a lot of guesswork.
3. Confirm whether preferred-replica rebalancing is expected in this cluster
You need to know what normal should look like in your environment before calling the current state wrong.
4. Compare workload heat with leadership distribution
This is where you separate real traffic skew from cluster-state skew.
5. Decide whether the fix belongs in cluster leadership or workload design
Do not tune producers first if the cluster itself is holding leadership unevenly.
Quick commands to ground the investigation
kafka-topics.sh --bootstrap-server <broker:9092> --describe --topic <topic>
kafka-configs.sh --bootstrap-server <broker:9092> --entity-type brokers --describe
grep -i leader <broker-log-file>
Use these to inspect partition leadership, broker configuration context, and logs showing leadership movement or imbalance.
A quick branch for the first pass
When some brokers stay hot, ask which of these became visible first:
- broker restart or failure
- uneven leader counts
- producer retries and latency on the same nodes
- obvious traffic skew before any broker event
That branch usually tells you whether to investigate cluster recovery behavior or workload design first.
Bottom Line
Do not treat hot Kafka brokers as automatic proof that the cluster needs more capacity. First prove whether leadership distribution became uneven after recovery, then separate that from true workload skew. If leadership is the real issue, client tuning and capacity changes will only hide the imbalance for a while.
FAQ
Q. Does leader imbalance always mean Kafka is broken?
No. It often means the cluster recovered from disruption but leadership stayed uneven.
Q. What is the fastest first step?
Check whether the hotspot pattern appeared after broker restarts or failures.
Q. What should I compare this with next?
Usually Kafka Producer Retries Too Much or Kafka Consumer Lag Increasing, depending on which symptom is visible first.
Q. When should I stop blaming workload skew?
When the hottest brokers clearly also hold a disproportionate number of leaders after a recovery event.
Read Next
- If hot brokers also show producer-side retry pressure, continue with Kafka Producer Retries Too Much.
- If the visible symptom is consumer backlog, continue with Kafka Consumer Lag Increasing.
- If broker instability turned into consumer-group churn, continue with Kafka Rebalancing Too Often.
Related Posts
Sources:
- https://kafka.apache.org/40/operations/basic-kafka-operations/
- https://kafka.apache.org/42/operations/consumer-rebalance-protocol/
While AdSense review is pending, related guides are shown instead of ads.
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Redis vs RabbitMQ vs Kafka A practical middleware troubleshooting guide for developers covering when to reach for Redis, RabbitMQ, or Kafka symptoms first, and which problem patterns usually belong to each tool.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Kafka Consumer Lag Increasing: Troubleshooting Guide A practical Kafka consumer lag troubleshooting guide covering what lag usually means, which consumer metrics to check first, and how poll timing, processing speed, and fetch patterns affect lag.
- Kafka Rebalancing Too Often: Common Causes and Fixes A practical Kafka troubleshooting guide covering why consumer groups rebalance too often, what poll timing and group protocol settings matter, and how to stop rebalances from interrupting useful work.
- Docker Container Keeps Restarting: What to Check First A practical Docker restart-loop troubleshooting guide covering exit codes, command failures, environment mistakes, health checks, and what to inspect first.
While AdSense review is pending, related guides are shown instead of ads.