LEARN · DEBUGGING GUIDE

Debugging CrashLoopBackOff Failures in Kubernetes Pods

CrashLoopBackOff means your pod container keeps failing and restarting. Skip the basics—here’s how to actually find, root-cause, and fix these failures in production.

AdvancedKubernetes5 min read

What this usually means

CrashLoopBackOff means the primary container in the pod is crashing repeatedly, and Kubernetes’s exponential back-off logic is delaying restarts. The root cause can be anything that causes a container process to exit non-zero: misconfigurations, missing secrets, application-level panics, failing probes triggering restarts, or system-level issues like resource limits. It's not just about the app failing—Kubernetes can restart on probe failures even if the app is technically running but not passing health checks.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run: kubectl describe pod <pod> to check Events for reasons—look for probe failures, failed mounts, OOMKilled, or permission issues.
  • 2Run: kubectl logs <pod> --previous to get logs from the last terminated container; repeat for each container if multi-container pod.
  • 3Run: kubectl get rs,deploy,sts -o wide --selector=app=<label> to see if this is a cluster-wide config/deploy pattern.
  • 4Check for liveness and readiness probe definitions in the pod spec: kubectl get pod <pod> -o yaml | grep -A10 'livenessProbe\|readinessProbe'
  • 5Exec into a pod in init or running state (if possible): kubectl exec -it <pod> -- /bin/sh and check config file existence, secret mount, or service endpoint reachability.
  • 6Check for recent config or secret rotations in git or your config management system in the past 24h.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • search/var/log/pods/<namespace>_<podname>_* on the node (for persistent log output)
  • searchkubectl logs <pod> --previous and current (especially for short-lifetime pods)
  • searchEvents section in kubectl describe pod <pod> (look for specifics like 'Liveness probe failed' or 'Back-off restarting failed container')
  • searchPod YAML spec (kubectl get pod <pod> -o yaml) for env vars, image, command/args, probe settings
  • searchCluster monitoring dashboards (Grafana/Prometheus) for CPU/mem/OOM trends at crash times
  • searchConfigMap and Secret references in deployment YAMLs (check kubectl get configmap|secret and compare hashes)
  • searchAdmission controller/webhook logs if mutations or policies could reject or alter pod spec
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningApplication segfault or panic due to missing config/env/secret
  • warningMisconfigured liveness probe (wrong path, port, or overly aggressive initialDelaySeconds)
  • warningContainer process exits immediately (wrong entrypoint/CMD, missing binary)
  • warningSecret/config volume not mounted or recently rotated with missing/invalid data
  • warningOOMKilled due to resource requests/limits too low for startup
  • warningStartup dependencies (e.g., DB, Redis) unavailable at pod startup
  • warningFilesystem permission errors—mount points owned by root, container running as non-root
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildIncrease initialDelaySeconds and periodSeconds for probes to avoid early probe-triggered kills
  • buildPatch deployment to temporarily disable liveness probe: kubectl patch deployment <dep> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":null}]}}}'
  • buildEnsure required ConfigMaps and Secrets exist and are referenced with the correct keys/paths; check for recent rotations
  • buildBump resource requests/limits in the deployment manifest to give the container enough memory/CPU
  • buildAdd sleep or readiness gates to entrypoint to wait for critical upstream services
  • buildChange securityContext to match the file ownership (e.g., runAsUser: 0 or chown volumes in initContainer)
  • buildChange entrypoint or CMD to match the actual available binary in the container
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedkubectl get pods shows pod moves from CrashLoopBackOff to Running or Ready within a few minutes
  • verifiedContainer restart count stabilizes (RESTARTS column stops incrementing)
  • verifiedkubectl logs <pod> no longer shows abrupt termination, segfault, or probe-failure messages
  • verifiedkubectl describe pod <pod> shows no recent Events of 'Back-off restarting failed container'
  • verifiedApplication endpoints become available and pass liveness/readiness probes (200 OK or expected output)
  • verifiedMonitoring dashboards show healthy memory/CPU usage and no spikes at former crash intervals
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningBlaming the application without checking probe configuration—probes kill containers even if the process works
  • warningForgetting to check --previous logs for short-lived containers (crashes may not appear in current logs)
  • warningIgnoring differences between liveness and readiness probe failures—only liveness restarts
  • warningBlindly increasing restart limits—Kubernetes has built-in exponential backoff, not infinite retries
  • warningMissing recent config/secret changes rolled out by another team or automation
  • warningSkipping checks on image SHA/tag drift (wrong image pushed under same tag)
( 07 )War story

CrashLoopBackOff After ConfigMap Rotation in Production

SRE on-callKubernetes 1.22, GKE, Node.js (alpine), Nginx sidecar, Prometheus

Timeline

  1. 11:04PagerDuty alert: auth-api pod stuck in CrashLoopBackOff in production
  2. 11:06kubectl get pods shows 9/10 pods in CrashLoopBackOff; restarts incrementing every 90 seconds
  3. 11:08kubectl logs --previous shows: "Error: Cannot find module '/app/config/default.json'"
  4. 11:09kubectl describe pod shows repeated liveness probe failures
  5. 11:11Checked git history: new ConfigMap committed, file moved from config/default.json to config/app.json
  6. 11:16Patched ConfigMap mount path and rolled deployment
  7. 11:18Pods restart, move to Running, endpoints recover

I was paged for a Spike in 5xx errors and noticed all auth-api pods in CrashLoopBackOff. The restarts were every couple of minutes, so logs cycled fast.

Initial logs and describe pod output pointed to a missing config file. Scanning the config commit history, I spotted a recent ConfigMap update that changed the config file's name, but the pod spec hadn't been updated.

Once I patched the mount path and redeployed, pods launched cleanly, restart counters stabilized, and the API passed health checks within 90 seconds.

Root cause

ConfigMap renamed a critical config file, but deployment manifest still pointed at the old file path; containers exited immediately.

The fix

Patched deployment to mount ConfigMap at correct path, then rolled deployment to pick up fix.

The lesson

Whenever ConfigMaps are updated—especially file renames—double-check deployment volumeMount paths match new config structure before rollout.

( 08 )Reading Between the Lines of Pod Events

The Events section from kubectl describe pod isn’t just noise—look for Back-off restarting failed container and probe failure messages. Repeated liveness probe failures are a hint the app is up, but unhealthy. OOMKilled events mean resource starvation, not a code bug.

For attacks that only appear on some nodes, match pod nodeName to node logs; sometimes only certain nodes have the missing secret or config.

( 09 )Probes: Friend and Foe

Liveness probes kill containers if the endpoint fails, regardless of process status. It’s common to see apps with slow startups killed by an aggressive initialDelaySeconds. Don’t just disable probes—tune them (try initialDelaySeconds: 20, periodSeconds: 10).

Remember: readiness probes only gate service traffic. Liveness probes actually restart the pod, so focus your debugging there if you see CrashLoopBackOff.

( 10 )Short-Lived Containers: Recovery Tactics

When the container exits in under a second, standard kubectl logs often misses the only useful output. Always try kubectl logs <pod> --previous (or on the ReplicaSet directly).

If logs are still empty, exec into a debug pod (kubectl run -i --rm --tty debug --image=busybox -- sh) mounting the same volumes and try to cat any expected config or secret files.

( 11 )Cluster-Wide Patterns and Rollout Hazards

If many pods across namespaces crash simultaneously, suspect a shared ConfigMap/Secret or a global policy change. Check for spikes in updates with kubectl get events --sort-by='.lastTimestamp'.

Admission controllers and mutating webhooks can mangle pod specs; check for annotations or mutations applied at admission time that could invalidate volume mounts or resource requests.

Frequently asked questions

Why does the pod go into CrashLoopBackOff instead of just restarting normally?

Kubernetes uses exponential back-off for crashing containers to reduce resource thrash. If your container fails rapidly (exits non-zero), the pod winds up in CrashLoopBackOff so you notice and fix the underlying problem.

How can I tell if it's a probe issue or an app crash?

kubectl describe pod shows whether restarts are due to liveness probe failures (look for events like 'Liveness probe failed: HTTP probe failed with statuscode: 503') or a direct process exit (OOMKilled or normal exit code).

Logs are empty for my CrashLoopBackOff pod. What now?

Try kubectl logs <pod> --previous, as rapid restarts cycle logs. If that's empty, reconstruct the container startup in a debug pod with the same mounts, or attach directly to the container process if possible.

Can I force the pod to keep running for debugging?

Yes, temporarily replace the entrypoint/CMD with 'sleep 3600' in the deployment, then exec in and inspect filesystem, env, and config. Don't leave this in production.