Java Troubleshooting Guide: Where to Start With OOM, Thread Pools, and Runtime Pressure
The hardest part of a Java incident is usually not remembering one more JVM flag. It is deciding whether the visible bottleneck is memory pressure, queue backlog, hot-thread contention, or a runtime that was already saturating before the crash appeared.
This guide is the routing page for Java incidents on this blog. It helps you decide:
- whether the visible problem belongs mostly to the OOM branch, the executor-backlog branch, or the GC and CPU branches
- what to inspect in the first few minutes so you can choose the next article quickly
- which false starts often waste time during Java troubleshooting
The short version: route by the strongest visible bottleneck first, not by the most familiar tuning habit.
When this hub is the right starting point
Use this page when:
OutOfMemoryErroris visible- a thread-pool queue keeps growing
- GC pauses suddenly become much longer
- CPU stays hot while throughput falls
- thread dumps show blocked progress or waiting cycles
- the service looks saturated before it fully crashes
At that point, the most valuable thing is usually choosing the right branch, not applying a fix immediately.
A five-minute JVM triage pass
jcmd <pid> GC.heap_info
jcmd <pid> Thread.print
jcmd <pid> VM.native_memory summary
VM.native_memory summary is especially helpful when NMT is enabled. The point of this command set is not to solve the incident from one screen. The point is to decide which branch deserves the next thirty minutes.
- strong heap or memory-area pressure: start with memory
- blocked threads and waiting cycles: start with deadlock or contention
- hot threads and saturated executors: start with CPU or backlog
Three questions to ask early
1. Is the system crashing, stalling, or saturating?
An OOM, a deadlock, and a saturated executor can all look like “the service is bad,” but the first guide should not be the same.
2. Is the dominant shape memory growth or queue growth?
Large memory use does not automatically mean memory is the root cause. Queue backlog often pushes memory upward as a secondary symptom.
3. Is the visible pain mostly pause time or contention?
Long GC pauses and hot CPU can both hurt throughput. The right first branch depends on which one leads the incident.
When the memory branch should lead
Start by taking a heap dump to identify where OOM originated when:
- heap, metaspace, or native memory pressure is visible
- the exact OOM variant matters
- large collections, caches, or retained payloads look suspicious
- class-metadata growth looks abnormal
If the retained-object story is still unclear, continue immediately by analyzing a heap dump to inspect the reference tree.
When the backlog or executor branch should lead
Start by checking your thread pool and queue settings when:
- executor queues keep growing
- workers stay busy or blocked
- backlog stands out before memory becomes the loudest signal
Useful adjacent guides:
This branch is really about explaining why accepted work is not becoming completed work quickly enough.
When the GC or retained-heap branch should lead
Start with Java GC Pauses Too Long when:
- pause spikes are more visible than a clean crash
- traffic or payload changes increased allocation churn
- old-generation growth looks suspicious
If you need retained-object evidence rather than pause symptoms, move next to analyzing a heap dump to spot long-lived large objects.
When the CPU or contention branch should lead
Start by getting a thread dump to see where hot threads are spending their time when:
- CPU remains hot while throughput gets worse
- hot threads matter more than raw host-level metrics
- retries, spinning, contention, or wasted work look likely
If lock contention is especially obvious, find the exact lock acquisition bottleneck and review your concurrency design.
When the deadlock or stalled-progress branch should lead
Start by looking for lock ordering inversions or deadlocks when the service looks stuck rather than simply slow.
This branch fits when:
- thread dumps show waiting cycles
- lock ownership matters more than throughput metrics
- the service stops making real forward progress
In that situation, unblocking progress matters more than tuning throughput.
Common wrong starts
Raising heap when backlog is the real driver
If queue growth is leading the incident, more heap may only hide the symptom longer.
Treating every throughput drop as a CPU problem
Long GC pauses can feel like hot-CPU incidents from the outside, even when the first branch should be memory and allocation behavior.
Tweaking executors while deadlock signals are already visible
If thread dumps show waiting cycles, pool-size changes can make the picture noisier instead of clearer.
A very short routing map
- OOM or memory-area pressure is clearest: start with memory
- queue growth is clearest: start with thread-pool backlog
- pause spikes dominate: start with GC
- hot threads or contention dominate: start with CPU
- forward progress has nearly stopped: start with deadlock
Then compare one adjacent branch immediately after. Java incidents often overlap across backlog, memory, and lock behavior.
A practical incident order
- Write down the first bottleneck that became obvious.
- Decide whether the system is crashing, stalling, or saturating.
- Separate memory-area pressure from backlog pressure.
- Separate hot-thread contention from blocked-progress deadlock.
- Read the narrowest guide that matches the strongest symptom, then compare one neighboring branch.
That order usually reduces the chance of tuning the wrong layer first.
FAQ
Q. Is this a JVM tuning guide?
No. It is a symptom-first troubleshooting hub.
Q. What if queue growth and memory pressure appear together?
Start with the symptom that became visible first, then compare the adjacent guide right after.
Q. What if GC pauses got worse but OOM never happened?
That is still a valid memory-pressure branch. Start with GC pauses, then move to heap-dump analysis if needed.
Q. When should I check deadlock before CPU?
When forward progress has mostly stopped and thread dumps point to waiting cycles rather than just hot loops.
Read Next
- If memory pressure is the clearest symptom, prioritize finding the OOM root cause via heap dumps.
- If queue backlog is easiest to see or thread starvation is suspected, review the Java ForkJoinPool Starvation case study.
- If pause spikes are more visible than crashes, continue with Java GC Pauses Too Long.
- If hot threads or contention dominate, use thread dumps to isolate the exact bottleneck.
Related Posts
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Where to Start With Redis, RabbitMQ, or Kafka A practical middleware troubleshooting hub covering how to choose the right first branch when systems using Redis, RabbitMQ, and Kafka show cache drift, queue backlog, or consumer lag.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Technical Blog SEO Checklist for Astro: What to Fix Before You Wait for Traffic A practical Astro SEO checklist for technical blogs covering deployed-site checks, robots.txt, sitemap, canonical, hreflang, structured data, page-role metadata, noindex decisions, and verification commands.
- Canonical and hreflang Setup for Multilingual Blogs: What to Check and What Breaks A practical guide to canonical and hreflang setup for multilingual blogs, covering self-canonicals, reciprocal hreflang clusters, x-default, category pages, rendered HTML checks, and the mistakes that make one language version suppress another.
- OpenAI Codex CLI Setup Guide: Install, Auth, and Your First Task A practical OpenAI Codex CLI setup guide covering installation, sign-in, the first interactive run, Windows notes, and the safest workflow for your first real task.