What this usually means
Init containers run sequentially before any app container starts. When they don't complete, the pod never becomes Ready. The root cause is almost never a Kubernetes bug—it's something inside the init container itself: a command that hangs (like a network call waiting forever), a binary that crashes because of missing dependencies or environment variables, a resource limit that's too low (especially memory), or a volume mount that's missing or has wrong permissions. I've also seen init containers that rely on a service that isn't up yet, creating a startup deadlock because the init container retries forever. The non-obvious part is that kubectl describe shows the container state, but the real signal is in the init container's logs and the exit code.
The first ten minutes — establish facts before touching code.
- 1kubectl get pods -n <namespace> | grep Init to see which pods are stuck
- 2kubectl describe pod <pod> -n <namespace> | grep -A 10 Init to get the init container state and exit code
- 3kubectl logs <pod> -c <init-container-name> -n <namespace> --tail=50 to get the last logs
- 4kubectl logs <pod> -c <init-container-name> -n <namespace> --previous to see logs from the last crash if it restarted
The specific files, logs, configs, and dashboards that usually own this bug.
- searchkubectl describe pod output: Init Containers state, exit code, reason, and last state
- searchInit container logs via kubectl logs -c <init-container-name>
- searchPod events with kubectl get events --field-selector involvedObject.name=<pod>
- searchDeployment or StatefulSet spec to check init container definition (command, args, env)
- searchConfigMaps and Secrets referenced by env vars or volume mounts in the init container
- searchResource requests/limits on the init container (too low CPU/memory can cause OOM kill)
Practical causes, not theory. These are the things you will actually find.
- warningInit container entrypoint command fails silently (e.g., missing binary, wrong path)
- warningInit container depends on a network service that hasn't started (e.g., database, API)
- warningMemory limit set too low, causing OOM kill (exit code 137)
- warningVolume mount exists but is read-only or has wrong permissions
- warningEnvironment variable references a missing ConfigMap or Secret key
- warningInit container script has an infinite loop or blocking call (e.g., tail -f without timeout)
- warningImage pull failure due to wrong tag or private registry credentials
Concrete fix directions. Pick the one that matches your root cause.
- buildAdd a timeout to the init container's command (e.g., timeout 30 ./script.sh)
- buildSet explicit resource requests and limits for memory (start with 256Mi, adjust up)
- buildValidate all environment variables are present and correct in the ConfigMap/Secret
- buildChange the init container image to a debug image (busybox) and run a simple sleep to test the setup
- buildAdd liveness/readiness probes to the init container (if using sidecar pattern) or ensure dependency service is up
- buildUse a dedicated init container that retries with exponential backoff instead of infinite retry
A fix you cannot prove is a guess. Close the loop.
- verifiedkubectl get pods shows Init:0/1 → Init:1/1 → Running after the fix
- verifiedkubectl logs <pod> -c <init-container-name> shows clean exit (last line: 'done' or similar)
- verifiedkubectl describe pod shows Init container state: Terminated with reason: Completed
- verifiedPod becomes Ready and receives traffic (check endpoint slices)
- verifiedNo new events with type Warning for the init container
Things that make this bug worse or harder to find.
- warningLooking at app container logs instead of init container logs (they are separate)
- warningAssuming the init container works because it works in local Docker run (K8s networking differs)
- warningSetting too low memory limit without checking actual peak usage (use kubectl top pod --containers)
- warningNot using --previous flag when the init container restarted and the current logs are from the retry
- warningForgetting that init containers run sequentially—one failing blocks all later ones
Init container stuck for 45 minutes in production because of a missing timeout
Timeline
- 14:03PagerDuty alert: 50% of pods in 'pending' state for service 'billing-api'
- 14:05kubectl get pods shows all billing-api pods stuck in Init:0/1
- 14:07kubectl describe pod shows init container 'db-migrate' in CrashLoopBackOff
- 14:10kubectl logs -c db-migrate --previous shows 'connection refused' to PostgreSQL
- 14:15Confirmed PostgreSQL is up but connection string pointed to 'localhost' instead of RDS endpoint
- 14:18Updated ConfigMap with correct DB_HOST and redeployed
- 14:22Pods transition to Init:1/1 then Running. Alert resolved.
We had just rolled out a new Helm chart for billing-api. Within minutes, PagerDuty lit up. All new pods were stuck in Init after rolling update. The old pods were still serving traffic, but the deployment was blocked. I SSH'd into a node and ran kubectl describe. The init container 'db-migrate' was in CrashLoopBackOff with exit code 1. Logs from the previous run showed 'dial tcp 127.0.0.1:5432: connect: connection refused'. The init container was trying to reach PostgreSQL on localhost, but our database is an RDS instance.
I checked the ConfigMap for the service—it had DB_HOST set to 'billing-db.default.svc.cluster.local' from an old version, but the new chart referenced 'localhost' because the deployment had a sidecar proxy that wasn't deployed yet. The fix was simple: update the ConfigMap to point to the RDS endpoint and redeploy. But the real issue was that the migration script had no timeout—it retried forever with a 2-second sleep, so it would have stayed stuck until someone killed the pods manually.
After the ConfigMap change, the init container connected to PostgreSQL, ran the migration in 3 seconds, and exited. Pods went to Running. We added a timeout of 60 seconds to the migration script and a backoff cap. Also added an environment variable validation step in the init container to fail fast if required vars are missing. That incident taught me to always include timeout logic in init containers and to never assume cluster-internal service names are correct.
Root cause
Init container's migration script had no network timeout and was configured with the wrong database hostname (localhost instead of RDS endpoint), causing it to retry indefinitely.
The fix
Updated ConfigMap with correct DB_HOST and added a 60-second timeout to the migration command.
The lesson
Always add timeouts to init container commands and validate critical environment variables at startup.
Init containers are exactly like regular containers except they run to completion before any app container starts. They share the same pod lifecycle: they can be OOMKilled (exit 137), can fail with non-zero exit codes, and can be restarted if the pod's restart policy is Always or OnFailure. The key difference is that if an init container fails, the pod never becomes Ready, even if the restart policy would restart the app container later.
When debugging, always check the exit code. Exit code 137 (SIGKILL) means OOM. Exit code 143 (SIGTERM) means the pod was terminated. Exit code 1 usually means a script error. Use kubectl describe and look for 'State: Terminated' with 'Reason: Error' or 'Reason: OOMKilled'. The 'Last State' field shows the previous run's details, which is critical for CrashLoopBackOff.
Init containers often perform heavy tasks like database migrations, data downloads, or asset compilation. If you don't set resource limits, Kubernetes can overcommit and the node may evict the pod. But if you set limits too low, the init container will get OOMKilled repeatedly. I've seen teams set 128Mi memory limit on a migration that needed 512Mi. The init container would start, hit the limit, get killed, restart, and repeat forever.
To diagnose, use kubectl top pod <pod> --containers to see actual usage. If the init container is not running, you can't use top. Instead, look at the 'Last State' exit code in describe—137 means OOM. Then increase the memory limit to at least 2x the observed usage from a local run. Also consider setting CPU limits to prevent throttling during bursty init tasks.
A common pattern is an init container that waits for a service (e.g., database, cache) to be available. If that service is also deployed via Kubernetes and hasn't started yet (e.g., during a fresh deploy), the init container can block indefinitely. This creates a circular dependency if the service depends on the pod that the init container is part of.
The fix is to make the init container's retry logic have a finite timeout and a reasonable backoff. Use tools like 'curl --retry 5 --retry-delay 5' or a script with 'timeout 60'. Also, ensure that the service the init container depends on is deployed with higher priority (e.g., using init containers in that service as well? No—break the cycle by using a separate deployment or StatefulSet that must be healthy before the dependent deployment is created.)
Init containers often rely on environment variables from ConfigMaps or Secrets. If the ConfigMap or Secret doesn't exist in the namespace, or if a key is missing, the init container may start but fail when it tries to use the variable. For example, a shell script that does 'set -u' will exit if a variable is undefined.
To check, run kubectl describe configmap <name> and kubectl describe secret <name>. Also look at the pod's spec under spec.initContainers[].env. Use kubectl exec into a debug pod and echo the variable to confirm its value. Another trick: add a step in the init container that checks all required env vars are non-empty and fails with a clear message.
Frequently asked questions
How do I see the logs of an init container that has already terminated?
Use kubectl logs <pod> -c <init-container-name> --previous. This shows the logs from the last terminated container instance (the one that crashed). Without --previous, you get the current (possibly empty) logs if it restarted.
Can I set a liveness probe on an init container?
No, init containers don't support liveness or readiness probes because they are expected to run to completion. If you need a health check, run the check as part of the init container's script and exit non-zero on failure. Alternatively, use a sidecar container pattern instead of an init container for long-running setup tasks.
What does 'Init:CrashLoopBackOff' mean?
It means the init container is repeatedly crashing (exiting with non-zero code), and Kubernetes is backing off before restarting it. The backoff doubles each time (10s, 20s, 40s... up to 5 minutes). This is a clear sign of a bug in the init container's command or configuration.
How do I debug an init container that hangs without crashing?
If the init container is stuck in Init:0/1 but not restarting, it's likely hanging. You can exec into the pod? No, because init containers don't allow exec. Instead, use kubectl logs -f to stream logs, or check the container's resource usage from the node (e.g., crictl stats). If you have access to the node, use 'crictl ps' to find the container and 'crictl logs' to see its output. Or redeploy with a debug init container that sleeps and exec into that.
Why does my init container work locally but not in Kubernetes?
Common differences: environment variables (missing or different), networking (localhost vs service DNS), file permissions on mounted volumes, or resource limits (local Docker has no limits by default). Always run the init container with the same image and command in a pod with similar resource constraints to replicate.