Java Heap Dump: What to Check First
Last updated on

Java Heap Dump: What to Check First


When you capture a Java heap dump, the hardest part is often deciding whether you are looking at a true leak, a workload burst, or a backlog that temporarily retains too many objects. A heap dump gives you a real object graph, but it still needs interpretation.

The short version: start with retained memory, not raw object count. The biggest class by instance count is not always the real problem. The more useful question is which objects dominate retained heap and why those objects are still reachable.

If you want the wider Java routing view first, step back to the Java Troubleshooting Guide.


Start with retained memory, not raw count

A class with many instances is not automatically the culprit.

What matters more is:

  • retained size
  • dominator relationships
  • reference chains
  • whether the objects should still be alive at all

That is why dominators and paths to GC roots are usually more valuable than a simple histogram.


What heap dumps are most useful for

Heap dumps help most when you need to answer questions like:

  • what is retaining the most memory right now?
  • is this growth coming from cache, backlog, or leaked references?
  • are objects staying alive longer than expected?
  • is this consistent with OOM or long GC pauses?

A dump taken at the right moment can replace guessing with a concrete retention path.


Common causes

1. Large collections keep growing

This is one of the most common findings.

Maps, lists, queues, and caches can dominate heap when entries are:

  • never evicted
  • consumed too slowly
  • duplicated across requests
  • larger than expected

The memory issue may not be a leak in the classic sense. It may simply be unbounded retention.

2. Backlog retains request data

Queued work can keep payloads, contexts, responses, and closures alive much longer than intended.

If the system is falling behind, the heap dump may show the queue symptom more clearly than the original performance bottleneck.

3. Reference chains prevent cleanup

Objects that should be collectible may still be reachable through:

  • singletons
  • static holders
  • thread locals
  • listener registries
  • caches with no real expiration

This is where paths to GC roots become especially valuable.

4. The snapshot was taken at the wrong moment

A dump captured during a short burst may show temporary pressure rather than a long-term leak.

That is why one dump is useful, but comparing timing with traffic and multiple dumps is often better.

5. Large retained graphs are only part of the story

Sometimes the dump shows large retained structures, but the real incident started elsewhere:

  • queue buildup
  • slow downstream dependencies
  • traffic spikes
  • retry storms

The dump still helps, but only if you read it in operational context.


A practical debugging order

1. Identify the largest retained objects and dominators

Start with what dominates retained heap, not what merely has many instances.

This tells you where to spend your attention first.

2. Inspect the reference path that keeps them alive

Ask:

  • which object owns this graph?
  • should that owner still be reachable?
  • is the reference intentional or accidental?

This step is often where the incident shifts from “memory is high” to a real code path.

3. Compare caches, queues, and large collections with expected size

Do not just ask whether they are large. Ask whether they are large relative to design expectations.

For example:

  • is the queue size normal for current load?
  • is the cache bounded?
  • did a collection grow after a deployment?

4. Compare heap dump timing with traffic or deployment changes

The same retained graph can mean very different things depending on when the dump was taken.

A snapshot captured during a brief burst is not interpreted the same way as one taken after hours of steady growth.

5. Move back to pauses or OOM when the dump confirms retention

If the same retained graph is stretching GC pauses or pushing the service toward failure, connect the evidence back to the operational symptom.


Example: queue retention disguised as a leak

jcmd <pid> GC.heap_dump heap.hprof

Suppose the dump shows many request payload objects retained by tasks in a thread pool queue. That can look like a classic memory leak at first, but the real issue may be that workers slowed down and backlog kept those objects alive too long.

This is why a heap dump should be read together with queue and throughput signals.


What to change after you find the retained graph

If a cache or collection is unbounded

Add real limits, eviction, or lifecycle control.

If backlog retains too much data

Reduce queue buildup and fix the throughput bottleneck that keeps tasks waiting.

If thread locals or static references hold data

Tighten cleanup and ownership boundaries.

If the dump reflects a short burst

Confirm with later snapshots before labeling it a true leak.

If the same graph keeps growing across dumps

Treat it as a strong leak or retention signal and trace that owner path directly.


A useful incident question

Ask this:

Which object graph owns the most retained memory, and should that ownership still exist at this point in the request or task lifecycle?

That question usually leads to a real fix much faster than staring at class counts.


FAQ

Q. Does a large object count mean a leak?

Not always. Retained size and reachability matter more than count alone.

Q. Should I take several dumps?

Yes, if you need to compare whether the same retained graph keeps growing over time.

Q. What is the fastest first step?

Find the biggest dominators and the reference path that keeps them alive.

Q. Can a heap dump show backlog rather than a true leak?

Yes. Queued work often retains real memory pressure without being a classic forever leak.


Sources:

Start Here

Continue with the core guides that pull steady search traffic.