When a Go service seems stuck on a WaitGroup, the real problem is usually not the WaitGroup itself. It is more often a worker that never reaches Done, a blocking path that never returns, or an Add and Done balance that breaks under error, panic, or cancellation paths.
The short version: find which goroutine never completes and whether its Done path is still reachable. A stuck WaitGroup almost always means the counter never reached zero.
If you want the wider Go routing view first, step back to the Golang Troubleshooting Guide.
Quick Answer
If a WaitGroup is stuck, first assume the counter never returned to zero because one goroutine never completed or one Done path was skipped.
In many incidents, the WaitGroup itself is fine. The real issue is missing accounting, a blocked worker, or a cancellation path that never lets the goroutine return.
What to Check First
Use this order first:
- confirm which wait path is blocked and which goroutines are still alive
- inspect every
AddandDonepair on the stuck path - check for blocked channel operations or downstream waits
- verify panic, timeout, and cancellation branches still reach
Done - compare the incident with recent concurrency changes
If you do not know which goroutine is keeping the counter above zero, the rest of the debugging is mostly guesswork.
Start with Add and Done balance
WaitGroup bugs are usually accounting bugs or lifecycle bugs.
That means these questions matter first:
- where does
Addhappen? - where does
Donehappen? - can any path skip
Done? - can any worker block forever before returning?
Until those are answered, the WaitGroup itself is rarely the interesting part.
What a stuck WaitGroup usually looks like
In production, this often appears as:
- shutdown hanging forever
- one request path waiting for workers that never complete
- background jobs appearing done except one hidden goroutine
- operators suspecting deadlock when the real issue is a missing
Done
The visible symptom is “Wait never ends,” but the real cause is usually much more concrete.
Missing Done versus blocked worker
| Pattern | What it usually means | Better next step |
|---|---|---|
Done is skipped on an error branch | Accounting bug | Guarantee Done on every exit path |
defer wg.Done() exists but wait still hangs | Worker never returns | Find the blocking condition |
| New fan-out logic made waits sticky | Add and lifecycle changed together | Recheck worker launch and accounting order |
| Shutdown hangs only under pressure | Cancellation does not drain workers | Inspect stop conditions and blocked calls |
Common causes
1. Done is never called
One worker exits through an error branch, timeout branch, or panic path without decrementing the counter.
This is the most common cause.
2. Add happens in the wrong place
Calling Add too late or from an unsafe concurrent path can create confusing wait behavior.
In Go, where Add happens relative to worker launch matters a lot.
3. A worker is blocked forever
Channel waits, network calls, DB calls, or dependency stalls can keep one goroutine alive long enough to block the whole wait.
4. Shutdown and cancellation are incomplete
Workers that should stop on context cancellation may continue waiting forever.
That makes the WaitGroup look broken when the real issue is lifecycle coordination.
5. Panic or early-return behavior bypasses the expected path
Even when the happy path is correct, unusual exit branches can break the balance if Done is not guaranteed.
A practical debugging order
1. Confirm which wait path is blocked and which workers are still alive
The first job is to identify the goroutine that is preventing the counter from reaching zero.
2. Inspect every Add / Done pair around the stuck path
Do not trust memory here. Trace the real control flow.
3. Check for blocked channel operations or downstream calls
If a worker never returns, defer wg.Done() still will not help until the blocking condition changes.
4. Verify panic, timeout, and cancellation branches still call Done
This is where many real-world WaitGroup bugs hide.
5. Compare the issue with recent concurrency changes
New fan-out, new shutdown logic, or changed cancellation rules often explain why a WaitGroup suddenly became sticky.
Example: correct defer, wrong lifecycle
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
if err := doWork(ctx); err != nil {
return
}
}()
wg.Wait()
This looks safe, and often it is. But if doWork(ctx) never returns because it waits forever, wg.Done() is still unreachable in practice.
That is why defer wg.Done() is the safest pattern, but not the whole answer.
What to change after you find the stuck path
If Done can be skipped
Restructure the worker so Done is guaranteed on every exit path.
If Add is unsafe
Move it so accounting happens before worker launch and not from fragile concurrent timing.
If a worker blocks forever
Fix the blocking condition, timeout path, or dependency behavior first.
If cancellation is incomplete
Make worker shutdown explicit so Wait reflects real lifecycle completion.
If the issue is really leaked work
Treat it as a goroutine lifecycle problem, not only a WaitGroup bug.
A useful incident question
Ask this:
Which exact goroutine is keeping the WaitGroup counter above zero, and what concrete condition would let that goroutine exit?
That question almost always points toward the real bug quickly.
Bottom Line
Stuck WaitGroup incidents are usually lifecycle or accounting bugs before they are synchronization mysteries.
In practice, find the goroutine that never finishes, then check whether Done is reachable on every path. Once that is clear, the WaitGroup symptom usually collapses into a much smaller bug.
FAQ
Q. Is defer wg.Done() always enough?
It is usually the safest pattern, but it still does not fix workers that never return.
Q. What is the fastest first step?
Find which goroutine is still alive and whether its Done path is reachable.
Q. Is a stuck WaitGroup always a deadlock?
No. It is often missing accounting or one worker waiting forever on something else.
Q. Can context cancellation help?
Yes, but only if workers actually respect it and exit cleanly.
Read Next
- If blocked workers are piling up instead of finishing, continue with Golang Goroutine Leak.
- If cancellation closes work too early or unpredictably, compare with Golang Context Cancelled Too Early.
- If the wider issue is queueing and worker coordination, compare with Golang Worker Pool Backpressure.
- For the broader Go debugging map, browse the Golang Troubleshooting Guide.
Related Posts
- Golang Goroutine Leak
- Golang Context Cancelled Too Early
- Golang Worker Pool Backpressure
- Golang Troubleshooting Guide
Sources:
While AdSense review is pending, related guides are shown instead of ads.
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Redis vs RabbitMQ vs Kafka A practical middleware troubleshooting guide for developers covering when to reach for Redis, RabbitMQ, or Kafka symptoms first, and which problem patterns usually belong to each tool.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Kafka Consumer Lag Increasing: Troubleshooting Guide A practical Kafka consumer lag troubleshooting guide covering what lag usually means, which consumer metrics to check first, and how poll timing, processing speed, and fetch patterns affect lag.
- Kafka Rebalancing Too Often: Common Causes and Fixes A practical Kafka troubleshooting guide covering why consumer groups rebalance too often, what poll timing and group protocol settings matter, and how to stop rebalances from interrupting useful work.
- Docker Container Keeps Restarting: What to Check First A practical Docker restart-loop troubleshooting guide covering exit codes, command failures, environment mistakes, health checks, and what to inspect first.
While AdSense review is pending, related guides are shown instead of ads.