Mar 14, 2026

Last updated on Mar 31, 2026

Kubernetes OOMKilled: What to Check First

Our Java service kept getting OOMKilled at 512Mi limit even though the JVM heap was set to 256MB. The issue was that -Xmx only controls heap — thread stacks, metaspace, and off-heap buffers pushed total RSS well past the container limit. Switching to -XX:MaxRAMPercentage=70 and setting limits to 768Mi stopped the kills entirely.

When a pod is OOMKilled, Kubernetes is telling you the container hit its memory limit and the kernel killed it. The useful question is not just “should I raise the limit?” It is whether the workload has a leak, bursty memory behavior, or simply bad request and limit sizing for real production usage.

The short version: look at usage pattern and limits together. A high limit alone does not fix a leak, and a low limit can hide an otherwise healthy workload.

Quick Answer

If a pod is OOMKilled, first determine whether memory climbed steadily, spiked briefly, or simply hit a limit that is too small for real production usage. Most incidents fall into three buckets: undersized limits, leak-like retention inside the app, or bursty workload spikes that cross the limit even though average usage looks normal.

What to Check First

did memory grow steadily or spike sharply before restart?
how close was usage to the configured limit?
did the issue begin after a deploy, traffic jump, or payload-size change?
are requests and limits far away from observed runtime usage?
is backlog or dependency slowness causing more objects to stay in memory?

Start with usage pattern and limits together

Memory incidents are easy to oversimplify.

You need to compare:

actual memory usage
steady growth versus short spikes
requests versus limits
restart timing

Without that, it is hard to tell whether the incident is undersizing, retention, or burst pressure.

What `OOMKilled` usually means

In practice, this often points to one of these:

the memory limit is too low
the application retains memory too long
short spikes cross the limit even without a classic leak
request and limit sizing make behavior misleading

The restart reason is clear, but the fix path still depends on the usage shape.

Common causes

1. The memory limit is too low

The workload may be normal, but the configured limit sits below realistic peak usage.

This is common after:

higher traffic
larger payloads
more concurrent work

2. The application leaks or retains memory

Memory can grow steadily until the container hits its limit and gets killed.

This does not always mean a classic forever leak. Unbounded caches, queues, or large retained graphs can produce the same result operationally.

3. Bursty workloads create short spikes

Even without a leak, caches, requests, parsing, or batch work can create spikes that cross the limit briefly.

These incidents often look random if you only inspect average usage.

4. Requests and limits are misaligned

Poor request sizing makes scheduling and runtime behavior harder to reason about.

A pod may schedule under one assumption and then behave very differently at runtime under the actual memory limit.

5. The real issue started elsewhere

Queue backlog, retries, or downstream slowdowns can indirectly create higher memory retention inside the pod.

That is why OOM incidents often connect to workload behavior, not only memory settings.

A quick triage table

Symptom	Most likely cause	Check first
Memory climbs slowly until every restart	leak or retention	memory trend over time
Restarts happen only during bursts	spikes crossing limit	peak usage versus limit
Pod dies after traffic or payload increase	limit too low for current load	recent usage change and configured ceiling
Requests are tiny but runtime usage is large	misleading sizing	requests, limits, and actual peak usage
OOM started with queue backlog or retries	indirect retention from upstream pressure	in-flight work and dependency slowness

A practical debugging order

1. Confirm the pod was killed by memory pressure

Do not assume every restart is an OOM even if the pod is unstable.

Make sure the restart reason actually points to memory kill behavior.

2. Compare memory usage trend with the configured limit

You want to know:

does memory climb steadily?
does it spike sharply?
how close does it get to the limit before restart?

3. Separate steady growth from short spikes

That distinction is critical.

Steady growth suggests retention or leak-like behavior.

Short spikes suggest bursty workload or badly sized limits.

4. Check whether workload or dependency changes altered memory behavior

Recent changes in:

request size
traffic level
concurrency
response aggregation
retries

may explain why the pod now crosses limits.

5. Change requests and limits only after the usage pattern is clear

More memory may buy time, but it should not replace diagnosis if retention is the real problem.

Quick commands

kubectl describe pod <pod> -n <ns>
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml

These help you compare restart reason, live memory usage, and configured requests and limits in one pass.

Look for memory spikes near the limit, repeated OOM kill events, and whether requests and limits are clearly mis-sized for the workload.

What to change after you find the memory pattern

If the limit is just too low

Raise it intentionally based on observed peaks.

If retention is the issue

Fix cache, queue, or object-lifetime behavior before relying on bigger limits.

If spikes are the issue

Smooth the workload or size limits for realistic burst behavior.

If requests are misleading

Bring requests and limits closer to real workload needs so scheduling and runtime behavior make sense together.

If the incident started with backlog

Fix the backlog cause too, not just the pod memory ceiling.

A useful incident question

Ask this:

Did the pod die because memory kept growing, or because a workload burst briefly crossed a limit that was too tight for normal operation?

That distinction usually determines the right fix.

Bottom Line

An OOMKilled event tells you the container crossed its memory limit, not why it got there. Start by separating steady growth from short spikes, then compare that pattern with requests, limits, and recent workload changes. Raise limits only after you know whether the incident is simple undersizing or a behavior problem that will eventually consume the extra memory too.

FAQ

Q. Should I just increase the limit first?

Not before you know whether the issue is undersizing, bursty usage, or retention.

Q. What is the fastest first step?

Confirm OOM kill, then compare real memory usage with the current limit.

Q. Is every `OOMKilled` incident a memory leak?

No. Many come from spikes, backlog, or limits that are simply too low.

Q. Do requests matter too, or only limits?

Requests matter because they shape placement and cluster behavior, while limits shape the kill threshold.

Start Here

Continue with the core guides that pull steady search traffic.