Java GC Pauses Too Long: What to Check First
Last updated on

Java GC Pauses Too Long: What to Check First


When Java GC pauses get long, the fastest mistake is treating the incident like a generic JVM tuning problem. Long pauses usually reflect allocation pressure, retention, or a heap layout that no longer matches the workload. Changing collector flags too early often hides the real cause for a while without solving it.

The short version: start with pause shape, allocation rate, and old generation growth together. A single long stop-the-world pause and a steady stream of shorter pauses do not point to the same bottleneck.

If you want the wider Java routing view first, step back to the Java Troubleshooting Guide.


Start with pause shape, not only average latency

Average latency can hide the real GC pattern.

You need to distinguish between:

  • rare but very large pauses
  • frequent moderate pauses
  • pauses that rise only during traffic bursts
  • pauses that line up with old generation pressure

That pattern tells you whether the bigger issue is churn, retention, sizing, or a workload shift.


What long GC pauses often look like

In production, this problem often appears with:

  • request latency spikes
  • CPU rising during collections
  • old generation staying high after GC
  • recovery taking longer after traffic bursts
  • heap that looks “large enough” on paper but still pauses badly

The visible pause is often the symptom that finally forces investigation, not the first thing that went wrong.


Common causes

1. Allocation churn is too high

Short-lived objects can force frequent collections and make pause spikes more visible under load.

This often happens with:

  • high request fan-out
  • repeated object transformations
  • large temporary buffers
  • serialization-heavy paths

If allocation rate climbs sharply, even objects that die quickly can still create costly pause behavior.

2. Old generation retention is growing

If large structures stay live too long, major collections become slower and recovery gets harder.

Typical sources include:

  • caches
  • in-memory queues
  • response aggregation
  • long-lived collections

This is one of the clearest signs that the issue is not only churn but retention.

3. Heap sizing no longer matches traffic

A heap that was acceptable at lower traffic may start pausing badly when:

  • request volume grows
  • payloads become larger
  • concurrency increases
  • object lifetime shifts

This does not always mean “increase heap.” It means the old sizing assumption may no longer match production reality.

4. The real bottleneck is outside GC

CPU saturation, blocked threads, queue buildup, or downstream slowdown can make GC look like the main issue when it is only amplifying pressure.

If the service is already falling behind, GC may become the place where that pressure becomes visible.

5. Large retained objects make pauses expensive

Even before an OOM, oversized retained graphs can stretch pause time enough to hurt latency significantly.

That is why GC incidents often connect directly to heap-dump analysis.


A practical debugging order

1. Inspect GC logs or pause metrics for frequency and worst-case spikes

Do not just look at one average number.

Ask:

  • how often are pauses happening?
  • what is the worst-case pause?
  • did the pattern change recently?

2. Compare allocation rate with traffic changes

If allocation rate rises sharply after a deployment or workload change, churn may be the dominant cause.

This helps separate collector behavior from application behavior.

3. Check old generation growth and long-lived retention

If old generation stays elevated after GC, retention deserves more attention than collector tuning.

4. Compare heap sizing with the current workload

The heap may have been fine for an earlier traffic profile.

Still, avoid jumping straight to heap enlargement until you understand whether the memory is being used productively.

5. Move to heap analysis when retention still looks suspicious

If pauses line up with retained structures, the next useful artifact is usually a heap dump, not another round of random JVM flags.


Example: pause growth from retained objects

Map<String, byte[]> cache = new HashMap<>();
cache.put(key, new byte[10_000_000]);

Large retained objects or rapidly repeated allocations can stretch GC pauses even before the heap is technically full.

That is why long pauses do not require a full OOM to become severe.


What to change after you find the pattern

If churn is the main issue

Reduce needless allocation and object copying in hot paths.

If retention is the main issue

Trace the retained graph and shrink the structures that stay live too long.

If heap sizing is simply outdated

Resize intentionally, but only after confirming the workload and object lifetime story.

If queue or backlog is the deeper issue

Fix throughput collapse before blaming GC alone.

If CPU spikes rise with GC

Treat runtime pressure and memory pressure as one incident, not two separate mysteries.


A useful incident question

Ask this:

Are pauses long because the JVM is collecting too often, because too much old data remains live, or because the workload changed beyond the heap design?

That question is more actionable than “Should we change collectors?”


FAQ

Q. Should I switch collectors first?

Not before you know whether the real problem is allocation churn, retention, or simple undersizing.

Q. Do long pauses always mean a memory leak?

No. They can also come from bursty allocation, large heaps, or workload changes.

Q. What is the fastest first step?

Look at pause shape, allocation rate, and old generation growth together.

Q. When should I take a heap dump?

When retained heap still looks suspicious after you compare pause behavior with traffic and old generation growth.


Sources:

Start Here

Continue with the core guides that pull steady search traffic.