
You run kubectl get pods and there it is: a pod stuck in CrashLoopBackOff, restart count climbing — 5, 12, 30. The container starts, dies, starts again, dies again, and Kubernetes keeps trying with longer and longer pauses between attempts. It's one of the most common states you'll hit running workloads on Kubernetes, and also one of the most misunderstood: CrashLoopBackOff isn't an error in itself — it's Kubernetes telling you that your container keeps exiting and it's backing off before the next retry.
This guide explains exactly what the state means, the handful of causes behind almost every crash loop, and a repeatable way to diagnose and fix it.
What CrashLoopBackOff Actually Means
Break the name in two. CrashLoop: your container starts and then crashes (exits), repeatedly. BackOff: Kubernetes is deliberately waiting longer between each restart attempt so it doesn't hammer a broken container — an exponential backoff that grows 10s, 20s, 40s, up to a 5-minute cap.
So the status is really describing the kubelet's behavior, not a specific fault. The kubelet's default restart policy (Always) keeps trying to bring the container back; when it dies fast enough, often enough, the pod enters CrashLoopBackOff. The crucial implication: the real cause is whatever made your container exit — Kubernetes is just reporting the symptom. Your job is to find why the process inside died.
The Common Causes
Nearly every crash loop traces to one of these:
- The application errors out on startup. An unhandled exception, a stack trace, a panic — the process starts, throws, and exits non-zero. By far the most common cause.
- Missing or wrong configuration. A required environment variable, config file, or secret isn't there, so the app refuses to start. Database URLs and API keys are classic culprits.
- A failed dependency. The app can't reach its database, cache, or an upstream API at boot and exits instead of waiting — closely related to connection errors like "connection refused."
- OOMKilled. The container exceeds its memory limit, the kernel kills it (exit code 137), and it loops. Crash loops with exit 137 are a memory problem, not a code one.
- Misconfigured liveness probe. If a liveness probe is too aggressive — too short a timeout, or pointing at an endpoint that isn't ready yet — Kubernetes kills a healthy container before it finishes booting, creating an artificial loop.
- The command or entrypoint is wrong. A bad
command/args, a missing binary, or a script that exits immediately (a container with nothing long-running to do will exit 0 and loop too).
Notice the pattern: the container almost always dies at or near startup. That's what makes crash loops both frustrating and, once you know where to look, fast to diagnose.
How to Diagnose It Step by Step
Work the problem in this order — each step narrows it down:
- Describe the pod.
kubectl describe pod <name>is your first stop. Look at theStateandLast Stateof the container, the exit code, and theEventsat the bottom. The exit code alone often tells the story (more below). - Read the logs.
kubectl logs <name>shows the current attempt. But a crashed container's logs are gone on restart — so usekubectl logs <name> --previousto see the output of the instance that just died. This is where the actual stack trace or error usually lives. - Interpret the exit code. It's a precise clue:
- 0 — exited "successfully"; usually means there's no long-running process (bad command, or a one-shot script). A loop on exit 0 is a design problem.
- 1 / 2 — a generic application error; check
--previouslogs for the exception. - 137 — killed by SIGKILL, almost always OOMKilled (memory limit) — confirm in
describe. - 139 — segfault (SIGSEGV); a native/binary crash.
- 143 — SIGTERM; killed during shutdown, often probe- or eviction-related.
- Check the events. The
Eventssection flags probe failures, image issues, and OOM kills explicitly. - Inspect config and probes. If logs look clean, suspect a too-aggressive liveness probe or a missing env var / secret the app needs.
How to Fix the Common Cases
Once you've found the cause, the fix usually follows directly:
- App crashes on startup → fix the code path or the bad input the
--previouslogs revealed; reproduce locally with the same image and config. - Missing config/secret → add the env var, ConfigMap, or Secret the app expects, and verify it's mounted/referenced correctly.
- Failed dependency → make the app wait and retry for dependencies at startup rather than exiting (an init container or readiness gating helps), so a slow database doesn't trigger a loop.
- OOMKilled (137) → raise the memory limit or fix the leak — see the OOMKilled guide.
- Aggressive liveness probe → increase
initialDelaySeconds/timeoutSeconds, or use astartupProbeso slow-booting apps aren't killed before they're ready. Point liveness at a genuine health endpoint, not a heavy route. - Wrong command → correct the
command/argsor Dockerfile entrypoint; ensure the main process runs in the foreground.
A useful trick for stubborn cases: temporarily override the entrypoint with a sleep (command: ["sleep", "3600"]) so the pod stays up, then kubectl exec in and run the real command by hand to watch it fail interactively.
How Webalert Helps
Fixing a crash loop happens inside the cluster — but knowing your service is degraded, and confirming it's healthy again after the fix, is where outside-in monitoring earns its place:
- Health-check and uptime monitoring that tells you when crashing pods have actually taken your service down for users — the impact that decides how urgent the page is.
- External verification that your endpoints respond correctly once pods recover, independent of what the cluster's own dashboards claim.
- Multi-region checks so you know whether a rollout-induced crash loop is affecting real traffic everywhere or just failing internally.
- Sustained-failure alerting that distinguishes a brief restart from a service-down crash loop, without alert noise.
Kubernetes restarts the pod; Webalert tells you whether your users could reach it the whole time.
Summary
CrashLoopBackOff means your container keeps starting and exiting, and Kubernetes is backing off between restarts — it's the symptom, not the cause. The cause is almost always something at startup: an app that errors out, missing config or secrets, an unreachable dependency, an out-of-memory kill, or a liveness probe that's too aggressive.
Diagnose it methodically: kubectl describe pod for the exit code and events, kubectl logs --previous for the dying instance's output, and the exit code itself as a shortcut (137 = memory, 1/2 = app error, 0 = no long-running process). Fix the root cause, loosen overly strict probes, and make startup resilient to slow dependencies. Then confirm with outside-in monitoring that users can actually reach the recovered service.