Kubernetes OOMKilled: What to Check First
Last updated on

Kubernetes OOMKilled: What to Check First


When a pod is OOMKilled, Kubernetes is telling you the container hit its memory limit and the kernel killed it. The useful question is not just “should I raise the limit?” It is whether the workload has a leak, bursty memory behavior, or simply bad request and limit sizing for real production usage.

The short version: look at usage pattern and limits together. A high limit alone does not fix a leak, and a low limit can hide an otherwise healthy workload.


Quick Answer

If a pod is OOMKilled, first determine whether memory climbed steadily, spiked briefly, or simply hit a limit that is too small for real production usage. Most incidents fall into three buckets: undersized limits, leak-like retention inside the app, or bursty workload spikes that cross the limit even though average usage looks normal.

What to Check First

  • did memory grow steadily or spike sharply before restart?
  • how close was usage to the configured limit?
  • did the issue begin after a deploy, traffic jump, or payload-size change?
  • are requests and limits far away from observed runtime usage?
  • is backlog or dependency slowness causing more objects to stay in memory?

Start with usage pattern and limits together

Memory incidents are easy to oversimplify.

You need to compare:

  • actual memory usage
  • steady growth versus short spikes
  • requests versus limits
  • restart timing

Without that, it is hard to tell whether the incident is undersizing, retention, or burst pressure.


What OOMKilled usually means

In practice, this often points to one of these:

  • the memory limit is too low
  • the application retains memory too long
  • short spikes cross the limit even without a classic leak
  • request and limit sizing make behavior misleading

The restart reason is clear, but the fix path still depends on the usage shape.


Common causes

1. The memory limit is too low

The workload may be normal, but the configured limit sits below realistic peak usage.

This is common after:

  • higher traffic
  • larger payloads
  • more concurrent work

2. The application leaks or retains memory

Memory can grow steadily until the container hits its limit and gets killed.

This does not always mean a classic forever leak. Unbounded caches, queues, or large retained graphs can produce the same result operationally.

3. Bursty workloads create short spikes

Even without a leak, caches, requests, parsing, or batch work can create spikes that cross the limit briefly.

These incidents often look random if you only inspect average usage.

4. Requests and limits are misaligned

Poor request sizing makes scheduling and runtime behavior harder to reason about.

A pod may schedule under one assumption and then behave very differently at runtime under the actual memory limit.

5. The real issue started elsewhere

Queue backlog, retries, or downstream slowdowns can indirectly create higher memory retention inside the pod.

That is why OOM incidents often connect to workload behavior, not only memory settings.

A quick triage table

SymptomMost likely causeCheck first
Memory climbs slowly until every restartleak or retentionmemory trend over time
Restarts happen only during burstsspikes crossing limitpeak usage versus limit
Pod dies after traffic or payload increaselimit too low for current loadrecent usage change and configured ceiling
Requests are tiny but runtime usage is largemisleading sizingrequests, limits, and actual peak usage
OOM started with queue backlog or retriesindirect retention from upstream pressurein-flight work and dependency slowness

A practical debugging order

1. Confirm the pod was killed by memory pressure

Do not assume every restart is an OOM even if the pod is unstable.

Make sure the restart reason actually points to memory kill behavior.

2. Compare memory usage trend with the configured limit

You want to know:

  • does memory climb steadily?
  • does it spike sharply?
  • how close does it get to the limit before restart?

3. Separate steady growth from short spikes

That distinction is critical.

Steady growth suggests retention or leak-like behavior.

Short spikes suggest bursty workload or badly sized limits.

4. Check whether workload or dependency changes altered memory behavior

Recent changes in:

  • request size
  • traffic level
  • concurrency
  • response aggregation
  • retries

may explain why the pod now crosses limits.

5. Change requests and limits only after the usage pattern is clear

More memory may buy time, but it should not replace diagnosis if retention is the real problem.


Quick commands

kubectl describe pod <pod> -n <ns>
kubectl top pod <pod> -n <ns>
kubectl get pod <pod> -n <ns> -o yaml

These help you compare restart reason, live memory usage, and configured requests and limits in one pass.

Look for memory spikes near the limit, repeated OOM kill events, and whether requests and limits are clearly mis-sized for the workload.


What to change after you find the memory pattern

If the limit is just too low

Raise it intentionally based on observed peaks.

If retention is the issue

Fix cache, queue, or object-lifetime behavior before relying on bigger limits.

If spikes are the issue

Smooth the workload or size limits for realistic burst behavior.

If requests are misleading

Bring requests and limits closer to real workload needs so scheduling and runtime behavior make sense together.

If the incident started with backlog

Fix the backlog cause too, not just the pod memory ceiling.


A useful incident question

Ask this:

Did the pod die because memory kept growing, or because a workload burst briefly crossed a limit that was too tight for normal operation?

That distinction usually determines the right fix.

Bottom Line

An OOMKilled event tells you the container crossed its memory limit, not why it got there. Start by separating steady growth from short spikes, then compare that pattern with requests, limits, and recent workload changes. Raise limits only after you know whether the incident is simple undersizing or a behavior problem that will eventually consume the extra memory too.


FAQ

Q. Should I just increase the limit first?

Not before you know whether the issue is undersizing, bursty usage, or retention.

Q. What is the fastest first step?

Confirm OOM kill, then compare real memory usage with the current limit.

Q. Is every OOMKilled incident a memory leak?

No. Many come from spikes, backlog, or limits that are simply too low.

Q. Do requests matter too, or only limits?

Requests matter because they shape placement and cluster behavior, while limits shape the kill threshold.


Sources:

Start Here

Continue with the core guides that pull steady search traffic.