What this usually means
CrashLoopBackOff means the primary container in the pod is crashing repeatedly, and Kubernetes’s exponential back-off logic is delaying restarts. The root cause can be anything that causes a container process to exit non-zero: misconfigurations, missing secrets, application-level panics, failing probes triggering restarts, or system-level issues like resource limits. It's not just about the app failing—Kubernetes can restart on probe failures even if the app is technically running but not passing health checks.
The first ten minutes — establish facts before touching code.
- 1Run: kubectl describe pod <pod> to check Events for reasons—look for probe failures, failed mounts, OOMKilled, or permission issues.
- 2Run: kubectl logs <pod> --previous to get logs from the last terminated container; repeat for each container if multi-container pod.
- 3Run: kubectl get rs,deploy,sts -o wide --selector=app=<label> to see if this is a cluster-wide config/deploy pattern.
- 4Check for liveness and readiness probe definitions in the pod spec: kubectl get pod <pod> -o yaml | grep -A10 'livenessProbe\|readinessProbe'
- 5Exec into a pod in init or running state (if possible): kubectl exec -it <pod> -- /bin/sh and check config file existence, secret mount, or service endpoint reachability.
- 6Check for recent config or secret rotations in git or your config management system in the past 24h.
The specific files, logs, configs, and dashboards that usually own this bug.
- search/var/log/pods/<namespace>_<podname>_* on the node (for persistent log output)
- searchkubectl logs <pod> --previous and current (especially for short-lifetime pods)
- searchEvents section in kubectl describe pod <pod> (look for specifics like 'Liveness probe failed' or 'Back-off restarting failed container')
- searchPod YAML spec (kubectl get pod <pod> -o yaml) for env vars, image, command/args, probe settings
- searchCluster monitoring dashboards (Grafana/Prometheus) for CPU/mem/OOM trends at crash times
- searchConfigMap and Secret references in deployment YAMLs (check kubectl get configmap|secret and compare hashes)
- searchAdmission controller/webhook logs if mutations or policies could reject or alter pod spec
Practical causes, not theory. These are the things you will actually find.
- warningApplication segfault or panic due to missing config/env/secret
- warningMisconfigured liveness probe (wrong path, port, or overly aggressive initialDelaySeconds)
- warningContainer process exits immediately (wrong entrypoint/CMD, missing binary)
- warningSecret/config volume not mounted or recently rotated with missing/invalid data
- warningOOMKilled due to resource requests/limits too low for startup
- warningStartup dependencies (e.g., DB, Redis) unavailable at pod startup
- warningFilesystem permission errors—mount points owned by root, container running as non-root
Concrete fix directions. Pick the one that matches your root cause.
- buildIncrease initialDelaySeconds and periodSeconds for probes to avoid early probe-triggered kills
- buildPatch deployment to temporarily disable liveness probe: kubectl patch deployment <dep> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":null}]}}}'
- buildEnsure required ConfigMaps and Secrets exist and are referenced with the correct keys/paths; check for recent rotations
- buildBump resource requests/limits in the deployment manifest to give the container enough memory/CPU
- buildAdd sleep or readiness gates to entrypoint to wait for critical upstream services
- buildChange securityContext to match the file ownership (e.g., runAsUser: 0 or chown volumes in initContainer)
- buildChange entrypoint or CMD to match the actual available binary in the container
A fix you cannot prove is a guess. Close the loop.
- verifiedkubectl get pods shows pod moves from CrashLoopBackOff to Running or Ready within a few minutes
- verifiedContainer restart count stabilizes (RESTARTS column stops incrementing)
- verifiedkubectl logs <pod> no longer shows abrupt termination, segfault, or probe-failure messages
- verifiedkubectl describe pod <pod> shows no recent Events of 'Back-off restarting failed container'
- verifiedApplication endpoints become available and pass liveness/readiness probes (200 OK or expected output)
- verifiedMonitoring dashboards show healthy memory/CPU usage and no spikes at former crash intervals
Things that make this bug worse or harder to find.
- warningBlaming the application without checking probe configuration—probes kill containers even if the process works
- warningForgetting to check --previous logs for short-lived containers (crashes may not appear in current logs)
- warningIgnoring differences between liveness and readiness probe failures—only liveness restarts
- warningBlindly increasing restart limits—Kubernetes has built-in exponential backoff, not infinite retries
- warningMissing recent config/secret changes rolled out by another team or automation
- warningSkipping checks on image SHA/tag drift (wrong image pushed under same tag)
CrashLoopBackOff After ConfigMap Rotation in Production
Timeline
- 11:04PagerDuty alert: auth-api pod stuck in CrashLoopBackOff in production
- 11:06kubectl get pods shows 9/10 pods in CrashLoopBackOff; restarts incrementing every 90 seconds
- 11:08kubectl logs --previous shows: "Error: Cannot find module '/app/config/default.json'"
- 11:09kubectl describe pod shows repeated liveness probe failures
- 11:11Checked git history: new ConfigMap committed, file moved from config/default.json to config/app.json
- 11:16Patched ConfigMap mount path and rolled deployment
- 11:18Pods restart, move to Running, endpoints recover
I was paged for a Spike in 5xx errors and noticed all auth-api pods in CrashLoopBackOff. The restarts were every couple of minutes, so logs cycled fast.
Initial logs and describe pod output pointed to a missing config file. Scanning the config commit history, I spotted a recent ConfigMap update that changed the config file's name, but the pod spec hadn't been updated.
Once I patched the mount path and redeployed, pods launched cleanly, restart counters stabilized, and the API passed health checks within 90 seconds.
Root cause
ConfigMap renamed a critical config file, but deployment manifest still pointed at the old file path; containers exited immediately.
The fix
Patched deployment to mount ConfigMap at correct path, then rolled deployment to pick up fix.
The lesson
Whenever ConfigMaps are updated—especially file renames—double-check deployment volumeMount paths match new config structure before rollout.
The Events section from kubectl describe pod isn’t just noise—look for Back-off restarting failed container and probe failure messages. Repeated liveness probe failures are a hint the app is up, but unhealthy. OOMKilled events mean resource starvation, not a code bug.
For attacks that only appear on some nodes, match pod nodeName to node logs; sometimes only certain nodes have the missing secret or config.
Liveness probes kill containers if the endpoint fails, regardless of process status. It’s common to see apps with slow startups killed by an aggressive initialDelaySeconds. Don’t just disable probes—tune them (try initialDelaySeconds: 20, periodSeconds: 10).
Remember: readiness probes only gate service traffic. Liveness probes actually restart the pod, so focus your debugging there if you see CrashLoopBackOff.
When the container exits in under a second, standard kubectl logs often misses the only useful output. Always try kubectl logs <pod> --previous (or on the ReplicaSet directly).
If logs are still empty, exec into a debug pod (kubectl run -i --rm --tty debug --image=busybox -- sh) mounting the same volumes and try to cat any expected config or secret files.
If many pods across namespaces crash simultaneously, suspect a shared ConfigMap/Secret or a global policy change. Check for spikes in updates with kubectl get events --sort-by='.lastTimestamp'.
Admission controllers and mutating webhooks can mangle pod specs; check for annotations or mutations applied at admission time that could invalidate volume mounts or resource requests.
Frequently asked questions
Why does the pod go into CrashLoopBackOff instead of just restarting normally?
Kubernetes uses exponential back-off for crashing containers to reduce resource thrash. If your container fails rapidly (exits non-zero), the pod winds up in CrashLoopBackOff so you notice and fix the underlying problem.
How can I tell if it's a probe issue or an app crash?
kubectl describe pod shows whether restarts are due to liveness probe failures (look for events like 'Liveness probe failed: HTTP probe failed with statuscode: 503') or a direct process exit (OOMKilled or normal exit code).
Logs are empty for my CrashLoopBackOff pod. What now?
Try kubectl logs <pod> --previous, as rapid restarts cycle logs. If that's empty, reconstruct the container startup in a debug pod with the same mounts, or attach directly to the container process if possible.
Can I force the pod to keep running for debugging?
Yes, temporarily replace the entrypoint/CMD with 'sleep 3600' in the deployment, then exec in and inspect filesystem, env, and config. Don't leave this in production.