In our core payment gateway, GC pauses suddenly spiked to 3 seconds during peak hours. The team initially tried increasing the heap size, but pauses only grew longer. The real issue was a newly introduced response caching layer that held onto large 20MB objects for too long, filling the old generation. Re-scoping the cache lifespan dropped our max pause time from 3000ms to 80ms overnight.
When Java GC pauses get long, the fastest mistake is treating the incident like a generic JVM tuning problem. Long pauses usually reflect allocation pressure, retention, or a heap layout that no longer matches the workload. Changing collector flags too early often hides the real cause for a while without solving it.
The short version: start with pause shape, allocation rate, and old generation growth together. A single long stop-the-world pause and a steady stream of shorter pauses do not point to the same bottleneck.
If you want the wider Java routing view first, step back to the Java Troubleshooting Guide.
Start with pause shape, not only average latency
Average latency can hide the real GC pattern.
You need to distinguish between:
- rare but very large pauses
- frequent moderate pauses
- pauses that rise only during traffic bursts
- pauses that line up with old generation pressure
That pattern tells you whether the bigger issue is churn, retention, sizing, or a workload shift.
What long GC pauses often look like
In production, this problem often appears with:
- request latency spikes
- CPU rising during collections
- old generation staying high after GC
- recovery taking longer after traffic bursts
- heap that looks “large enough” on paper but still pauses badly
The visible pause is often the symptom that finally forces investigation, not the first thing that went wrong.
Common causes
1. Allocation churn is too high
Short-lived objects can force frequent collections and make pause spikes more visible under load.
This often happens with:
- high request fan-out
- repeated object transformations
- large temporary buffers
- serialization-heavy paths
If allocation rate climbs sharply, even objects that die quickly can still create costly pause behavior.
2. Old generation retention is growing
If large structures stay live too long, major collections become slower and recovery gets harder.
Typical sources include:
- caches
- in-memory queues
- response aggregation
- long-lived collections
This is one of the clearest signs that the issue is not only churn but retention.
3. Heap sizing no longer matches traffic
A heap that was acceptable at lower traffic may start pausing badly when:
- request volume grows
- payloads become larger
- concurrency increases
- object lifetime shifts
This does not always mean “increase heap.” It means the old sizing assumption may no longer match production reality.
4. The real bottleneck is outside GC
CPU saturation, blocked threads, queue buildup, or downstream slowdown can make GC look like the main issue when it is only amplifying pressure.
If the service is already falling behind, GC may become the place where that pressure becomes visible.
5. Large retained objects make pauses expensive
Even before an OOM, oversized retained graphs can stretch pause time enough to hurt latency significantly.
That is why GC incidents often connect directly to heap-dump analysis.
A practical debugging order
1. Inspect GC logs or pause metrics for frequency and worst-case spikes
Do not just look at one average number.
Ask:
- how often are pauses happening?
- what is the worst-case pause?
- did the pattern change recently?
2. Compare allocation rate with traffic changes
If allocation rate rises sharply after a deployment or workload change, churn may be the dominant cause.
This helps separate collector behavior from application behavior.
3. Check old generation growth and long-lived retention
If old generation stays elevated after GC, retention deserves more attention than collector tuning.
4. Compare heap sizing with the current workload
The heap may have been fine for an earlier traffic profile.
Still, avoid jumping straight to heap enlargement until you understand whether the memory is being used productively.
5. Move to heap analysis when retention still looks suspicious
If pauses line up with retained structures, the next useful artifact is usually a heap dump, not another round of random JVM flags.
Example: pause growth from retained objects
Map<String, byte[]> cache = new HashMap<>();
cache.put(key, new byte[10_000_000]);
Large retained objects or rapidly repeated allocations can stretch GC pauses even before the heap is technically full.
That is why long pauses do not require a full OOM to become severe.
What to change after you find the pattern
If churn is the main issue
Reduce needless allocation and object copying in hot paths.
If retention is the main issue
Trace the retained graph and shrink the structures that stay live too long.
If heap sizing is simply outdated
Resize intentionally, but only after confirming the workload and object lifetime story.
If queue or backlog is the deeper issue
Fix throughput collapse before blaming GC alone.
If CPU spikes rise with GC
Treat runtime pressure and memory pressure as one incident, not two separate mysteries.
A useful incident question
Ask this:
Are pauses long because the JVM is collecting too often, because too much old data remains live, or because the workload changed beyond the heap design?
That question is more actionable than “Should we change collectors?”
FAQ
Q. Should I switch collectors first?
Not before you know whether the real problem is allocation churn, retention, or simple undersizing.
Q. Do long pauses always mean a memory leak?
No. They can also come from bursty allocation, large heaps, or workload changes.
Q. What is the fastest first step?
Look at pause shape, allocation rate, and old generation growth together.
Q. When should I take a heap dump?
When retained heap still looks suspicious after you compare pause behavior with traffic and old generation growth.
Read Next
- If the pause spike still looks like retained heap rather than allocation churn, use a heap dump to analyze object lifetimes and reference paths.
- If the same memory pressure is now pushing the service toward failure, identify the memory state at the exact point of OOM.
- If CPU rises together with GC pressure, use a thread dump to verify whether the actual hot threads are GC threads.
- If you need the wider symptom map again, return to the Java Troubleshooting Guide.
Related Posts
Sources:
- https://docs.oracle.com/en/java/javase/21/troubleshoot/
- https://docs.oracle.com/en/java/javase/21/gctuning/
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Where to Start With Redis, RabbitMQ, or Kafka A practical middleware troubleshooting hub covering how to choose the right first branch when systems using Redis, RabbitMQ, and Kafka show cache drift, queue backlog, or consumer lag.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Technical Blog SEO Checklist for Astro: What to Fix Before You Wait for Traffic A practical Astro SEO checklist for technical blogs covering deployed-site checks, robots.txt, sitemap, canonical, hreflang, structured data, page-role metadata, noindex decisions, and verification commands.
- Canonical and hreflang Setup for Multilingual Blogs: What to Check and What Breaks A practical guide to canonical and hreflang setup for multilingual blogs, covering self-canonicals, reciprocal hreflang clusters, x-default, category pages, rendered HTML checks, and the mistakes that make one language version suppress another.
- OpenAI Codex CLI Setup Guide: Install, Auth, and Your First Task A practical OpenAI Codex CLI setup guide covering installation, sign-in, the first interactive run, Windows notes, and the safest workflow for your first real task.