Golang WaitGroup Stuck: What to Check First
Last updated on

Golang WaitGroup Stuck: What to Check First


When a Go service seems stuck on a WaitGroup, the real problem is usually not the WaitGroup itself. It is more often a worker that never reaches Done, a blocking path that never returns, or an Add and Done balance that breaks under error, panic, or cancellation paths.

The short version: find which goroutine never completes and whether its Done path is still reachable. A stuck WaitGroup almost always means the counter never reached zero.

If you want the wider Go routing view first, step back to the Golang Troubleshooting Guide.


Quick Answer

If a WaitGroup is stuck, first assume the counter never returned to zero because one goroutine never completed or one Done path was skipped.

In many incidents, the WaitGroup itself is fine. The real issue is missing accounting, a blocked worker, or a cancellation path that never lets the goroutine return.

What to Check First

Use this order first:

  1. confirm which wait path is blocked and which goroutines are still alive
  2. inspect every Add and Done pair on the stuck path
  3. check for blocked channel operations or downstream waits
  4. verify panic, timeout, and cancellation branches still reach Done
  5. compare the incident with recent concurrency changes

If you do not know which goroutine is keeping the counter above zero, the rest of the debugging is mostly guesswork.

Start with Add and Done balance

WaitGroup bugs are usually accounting bugs or lifecycle bugs.

That means these questions matter first:

  • where does Add happen?
  • where does Done happen?
  • can any path skip Done?
  • can any worker block forever before returning?

Until those are answered, the WaitGroup itself is rarely the interesting part.

What a stuck WaitGroup usually looks like

In production, this often appears as:

  • shutdown hanging forever
  • one request path waiting for workers that never complete
  • background jobs appearing done except one hidden goroutine
  • operators suspecting deadlock when the real issue is a missing Done

The visible symptom is “Wait never ends,” but the real cause is usually much more concrete.

Missing Done versus blocked worker

PatternWhat it usually meansBetter next step
Done is skipped on an error branchAccounting bugGuarantee Done on every exit path
defer wg.Done() exists but wait still hangsWorker never returnsFind the blocking condition
New fan-out logic made waits stickyAdd and lifecycle changed togetherRecheck worker launch and accounting order
Shutdown hangs only under pressureCancellation does not drain workersInspect stop conditions and blocked calls

Common causes

1. Done is never called

One worker exits through an error branch, timeout branch, or panic path without decrementing the counter.

This is the most common cause.

2. Add happens in the wrong place

Calling Add too late or from an unsafe concurrent path can create confusing wait behavior.

In Go, where Add happens relative to worker launch matters a lot.

3. A worker is blocked forever

Channel waits, network calls, DB calls, or dependency stalls can keep one goroutine alive long enough to block the whole wait.

4. Shutdown and cancellation are incomplete

Workers that should stop on context cancellation may continue waiting forever.

That makes the WaitGroup look broken when the real issue is lifecycle coordination.

5. Panic or early-return behavior bypasses the expected path

Even when the happy path is correct, unusual exit branches can break the balance if Done is not guaranteed.

A practical debugging order

1. Confirm which wait path is blocked and which workers are still alive

The first job is to identify the goroutine that is preventing the counter from reaching zero.

2. Inspect every Add / Done pair around the stuck path

Do not trust memory here. Trace the real control flow.

3. Check for blocked channel operations or downstream calls

If a worker never returns, defer wg.Done() still will not help until the blocking condition changes.

4. Verify panic, timeout, and cancellation branches still call Done

This is where many real-world WaitGroup bugs hide.

5. Compare the issue with recent concurrency changes

New fan-out, new shutdown logic, or changed cancellation rules often explain why a WaitGroup suddenly became sticky.

Example: correct defer, wrong lifecycle

var wg sync.WaitGroup

wg.Add(1)
go func() {
	defer wg.Done()
	if err := doWork(ctx); err != nil {
		return
	}
}()

wg.Wait()

This looks safe, and often it is. But if doWork(ctx) never returns because it waits forever, wg.Done() is still unreachable in practice.

That is why defer wg.Done() is the safest pattern, but not the whole answer.

What to change after you find the stuck path

If Done can be skipped

Restructure the worker so Done is guaranteed on every exit path.

If Add is unsafe

Move it so accounting happens before worker launch and not from fragile concurrent timing.

If a worker blocks forever

Fix the blocking condition, timeout path, or dependency behavior first.

If cancellation is incomplete

Make worker shutdown explicit so Wait reflects real lifecycle completion.

If the issue is really leaked work

Treat it as a goroutine lifecycle problem, not only a WaitGroup bug.

A useful incident question

Ask this:

Which exact goroutine is keeping the WaitGroup counter above zero, and what concrete condition would let that goroutine exit?

That question almost always points toward the real bug quickly.

Bottom Line

Stuck WaitGroup incidents are usually lifecycle or accounting bugs before they are synchronization mysteries.

In practice, find the goroutine that never finishes, then check whether Done is reachable on every path. Once that is clear, the WaitGroup symptom usually collapses into a much smaller bug.

FAQ

Q. Is defer wg.Done() always enough?

It is usually the safest pattern, but it still does not fix workers that never return.

Q. What is the fastest first step?

Find which goroutine is still alive and whether its Done path is reachable.

Q. Is a stuck WaitGroup always a deadlock?

No. It is often missing accounting or one worker waiting forever on something else.

Q. Can context cancellation help?

Yes, but only if workers actually respect it and exit cleanly.

Sources:

Start Here

Continue with the core guides that pull steady search traffic.