Our Java service kept getting OOMKilled at 512Mi limit even though the JVM heap was set to 256MB. The issue was that -Xmx only controls heap — thread stacks, metaspace, and off-heap buffers pushed total RSS well past the container limit. Switching to -XX:MaxRAMPercentage=70 and setting limits to 768Mi stopped the kills entirely.
When a pod is OOMKilled, Kubernetes is telling you the container hit its memory limit and the kernel killed it. The useful question is not just “should I raise the limit?” It is whether the workload has a leak, bursty memory behavior, or simply bad request and limit sizing for real production usage.
The short version: look at usage pattern and limits together. A high limit alone does not fix a leak, and a low limit can hide an otherwise healthy workload.
Quick Answer
If a pod is OOMKilled, first determine whether memory climbed steadily, spiked briefly, or simply hit a limit that is too small for real production usage. Most incidents fall into three buckets: undersized limits, leak-like retention inside the app, or bursty workload spikes that cross the limit even though average usage looks normal.
What to Check First
- did memory grow steadily or spike sharply before restart?
- how close was usage to the configured limit?
- did the issue begin after a deploy, traffic jump, or payload-size change?
- are requests and limits far away from observed runtime usage?
- is backlog or dependency slowness causing more objects to stay in memory?
Start with usage pattern and limits together
Memory incidents are easy to oversimplify.
You need to compare:
- actual memory usage
- steady growth versus short spikes
- requests versus limits
- restart timing
Without that, it is hard to tell whether the incident is undersizing, retention, or burst pressure.
What OOMKilled usually means
In practice, this often points to one of these:
- the memory limit is too low
- the application retains memory too long
- short spikes cross the limit even without a classic leak
- request and limit sizing make behavior misleading
The restart reason is clear, but the fix path still depends on the usage shape.
Common causes
1. The memory limit is too low
The workload may be normal, but the configured limit sits below realistic peak usage.
This is common after:
- higher traffic
- larger payloads
- more concurrent work
2. The application leaks or retains memory
Memory can grow steadily until the container hits its limit and gets killed.
This does not always mean a classic forever leak. Unbounded caches, queues, or large retained graphs can produce the same result operationally.
3. Bursty workloads create short spikes
Even without a leak, caches, requests, parsing, or batch work can create spikes that cross the limit briefly.
These incidents often look random if you only inspect average usage.
4. Requests and limits are misaligned
Poor request sizing makes scheduling and runtime behavior harder to reason about.
A pod may schedule under one assumption and then behave very differently at runtime under the actual memory limit.
5. The real issue started elsewhere
Queue backlog, retries, or downstream slowdowns can indirectly create higher memory retention inside the pod.
That is why OOM incidents often connect to workload behavior, not only memory settings.
A quick triage table
| Symptom | Most likely cause | Check first |
|---|---|---|
| Memory climbs slowly until every restart | leak or retention | memory trend over time |
| Restarts happen only during bursts | spikes crossing limit | peak usage versus limit |
| Pod dies after traffic or payload increase | limit too low for current load | recent usage change and configured ceiling |
| Requests are tiny but runtime usage is large | misleading sizing | requests, limits, and actual peak usage |
| OOM started with queue backlog or retries | indirect retention from upstream pressure | in-flight work and dependency slowness |
A practical debugging order
1. Confirm the pod was killed by memory pressure
Do not assume every restart is an OOM even if the pod is unstable.
Make sure the restart reason actually points to memory kill behavior.
2. Compare memory usage trend with the configured limit
You want to know:
- does memory climb steadily?
- does it spike sharply?
- how close does it get to the limit before restart?
3. Separate steady growth from short spikes
That distinction is critical.
Steady growth suggests retention or leak-like behavior.
Short spikes suggest bursty workload or badly sized limits.
4. Check whether workload or dependency changes altered memory behavior
Recent changes in:
- request size
- traffic level
- concurrency
- response aggregation
- retries
may explain why the pod now crosses limits.
5. Change requests and limits only after the usage pattern is clear
More memory may buy time, but it should not replace diagnosis if retention is the real problem.
Quick commands
kubectl describe pod <pod> -n <ns>
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml
These help you compare restart reason, live memory usage, and configured requests and limits in one pass.
Look for memory spikes near the limit, repeated OOM kill events, and whether requests and limits are clearly mis-sized for the workload.
What to change after you find the memory pattern
If the limit is just too low
Raise it intentionally based on observed peaks.
If retention is the issue
Fix cache, queue, or object-lifetime behavior before relying on bigger limits.
If spikes are the issue
Smooth the workload or size limits for realistic burst behavior.
If requests are misleading
Bring requests and limits closer to real workload needs so scheduling and runtime behavior make sense together.
If the incident started with backlog
Fix the backlog cause too, not just the pod memory ceiling.
A useful incident question
Ask this:
Did the pod die because memory kept growing, or because a workload burst briefly crossed a limit that was too tight for normal operation?
That distinction usually determines the right fix.
Bottom Line
An OOMKilled event tells you the container crossed its memory limit, not why it got there. Start by separating steady growth from short spikes, then compare that pattern with requests, limits, and recent workload changes. Raise limits only after you know whether the incident is simple undersizing or a behavior problem that will eventually consume the extra memory too.
FAQ
Q. Should I just increase the limit first?
Not before you know whether the issue is undersizing, bursty usage, or retention.
Q. What is the fastest first step?
Confirm OOM kill, then compare real memory usage with the current limit.
Q. Is every OOMKilled incident a memory leak?
No. Many come from spikes, backlog, or limits that are simply too low.
Q. Do requests matter too, or only limits?
Requests matter because they shape placement and cluster behavior, while limits shape the kill threshold.
Read Next
- If the restart symptom is more general than memory pressure, compare with Kubernetes CrashLoopBackOff.
- If readiness never stabilizes even when the container stays alive, continue with Kubernetes Readiness Probe Failed.
- For the broader infrastructure archive, browse the Infra category.
Related Posts
Sources:
Start Here
Continue with the core guides that pull steady search traffic.
- Middleware Troubleshooting Guide: Where to Start With Redis, RabbitMQ, or Kafka A practical middleware troubleshooting hub covering how to choose the right first branch when systems using Redis, RabbitMQ, and Kafka show cache drift, queue backlog, or consumer lag.
- Kubernetes CrashLoopBackOff: What to Check First A practical Kubernetes CrashLoopBackOff troubleshooting guide covering startup failures, probe issues, config mistakes, and what to inspect first.
- Technical Blog SEO Checklist for Astro: What to Fix Before You Wait for Traffic A practical Astro SEO checklist for technical blogs covering deployed-site checks, robots.txt, sitemap, canonical, hreflang, structured data, page-role metadata, noindex decisions, and verification commands.
- Canonical and hreflang Setup for Multilingual Blogs: What to Check and What Breaks A practical guide to canonical and hreflang setup for multilingual blogs, covering self-canonicals, reciprocal hreflang clusters, x-default, category pages, rendered HTML checks, and the mistakes that make one language version suppress another.
- OpenAI Codex CLI Setup Guide: Install, Auth, and Your First Task A practical OpenAI Codex CLI setup guide covering installation, sign-in, the first interactive run, Windows notes, and the safest workflow for your first real task.