Kubernetes Init Container Not Completing Fix

What this usually means

Init containers run sequentially before any app container starts. When they don't complete, the pod never becomes Ready. The root cause is almost never a Kubernetes bug—it's something inside the init container itself: a command that hangs (like a network call waiting forever), a binary that crashes because of missing dependencies or environment variables, a resource limit that's too low (especially memory), or a volume mount that's missing or has wrong permissions. I've also seen init containers that rely on a service that isn't up yet, creating a startup deadlock because the init container retries forever. The non-obvious part is that kubectl describe shows the container state, but the real signal is in the init container's logs and the exit code.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1kubectl get pods -n <namespace> | grep Init to see which pods are stuck
2kubectl describe pod <pod> -n <namespace> | grep -A 10 Init to get the init container state and exit code
3kubectl logs <pod> -c <init-container-name> -n <namespace> --tail=50 to get the last logs
4kubectl logs <pod> -c <init-container-name> -n <namespace> --previous to see logs from the last crash if it restarted

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchkubectl describe pod output: Init Containers state, exit code, reason, and last state
searchInit container logs via kubectl logs -c <init-container-name>
searchPod events with kubectl get events --field-selector involvedObject.name=<pod>
searchDeployment or StatefulSet spec to check init container definition (command, args, env)
searchConfigMaps and Secrets referenced by env vars or volume mounts in the init container
searchResource requests/limits on the init container (too low CPU/memory can cause OOM kill)

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningInit container entrypoint command fails silently (e.g., missing binary, wrong path)
warningInit container depends on a network service that hasn't started (e.g., database, API)
warningMemory limit set too low, causing OOM kill (exit code 137)
warningVolume mount exists but is read-only or has wrong permissions
warningEnvironment variable references a missing ConfigMap or Secret key
warningInit container script has an infinite loop or blocking call (e.g., tail -f without timeout)
warningImage pull failure due to wrong tag or private registry credentials

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildAdd a timeout to the init container's command (e.g., timeout 30 ./script.sh)
buildSet explicit resource requests and limits for memory (start with 256Mi, adjust up)
buildValidate all environment variables are present and correct in the ConfigMap/Secret
buildChange the init container image to a debug image (busybox) and run a simple sleep to test the setup
buildAdd liveness/readiness probes to the init container (if using sidecar pattern) or ensure dependency service is up
buildUse a dedicated init container that retries with exponential backoff instead of infinite retry

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedkubectl get pods shows Init:0/1 → Init:1/1 → Running after the fix
verifiedkubectl logs <pod> -c <init-container-name> shows clean exit (last line: 'done' or similar)
verifiedkubectl describe pod shows Init container state: Terminated with reason: Completed
verifiedPod becomes Ready and receives traffic (check endpoint slices)
verifiedNo new events with type Warning for the init container

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningLooking at app container logs instead of init container logs (they are separate)
warningAssuming the init container works because it works in local Docker run (K8s networking differs)
warningSetting too low memory limit without checking actual peak usage (use kubectl top pod --containers)
warningNot using --previous flag when the init container restarted and the current logs are from the retry
warningForgetting that init containers run sequentially—one failing blocks all later ones

( 07 )War story

Init container stuck for 45 minutes in production because of a missing timeout

Platform Engineer, mid-size SaaS companyEKS 1.24, Helm charts, Go microservice, PostgreSQL RDS

Timeline

14:03PagerDuty alert: 50% of pods in 'pending' state for service 'billing-api'
14:05kubectl get pods shows all billing-api pods stuck in Init:0/1
14:07kubectl describe pod shows init container 'db-migrate' in CrashLoopBackOff
14:10kubectl logs -c db-migrate --previous shows 'connection refused' to PostgreSQL
14:15Confirmed PostgreSQL is up but connection string pointed to 'localhost' instead of RDS endpoint
14:18Updated ConfigMap with correct DB_HOST and redeployed
14:22Pods transition to Init:1/1 then Running. Alert resolved.

We had just rolled out a new Helm chart for billing-api. Within minutes, PagerDuty lit up. All new pods were stuck in Init after rolling update. The old pods were still serving traffic, but the deployment was blocked. I SSH'd into a node and ran kubectl describe. The init container 'db-migrate' was in CrashLoopBackOff with exit code 1. Logs from the previous run showed 'dial tcp 127.0.0.1:5432: connect: connection refused'. The init container was trying to reach PostgreSQL on localhost, but our database is an RDS instance.

I checked the ConfigMap for the service—it had DB_HOST set to 'billing-db.default.svc.cluster.local' from an old version, but the new chart referenced 'localhost' because the deployment had a sidecar proxy that wasn't deployed yet. The fix was simple: update the ConfigMap to point to the RDS endpoint and redeploy. But the real issue was that the migration script had no timeout—it retried forever with a 2-second sleep, so it would have stayed stuck until someone killed the pods manually.

After the ConfigMap change, the init container connected to PostgreSQL, ran the migration in 3 seconds, and exited. Pods went to Running. We added a timeout of 60 seconds to the migration script and a backoff cap. Also added an environment variable validation step in the init container to fail fast if required vars are missing. That incident taught me to always include timeout logic in init containers and to never assume cluster-internal service names are correct.

Root cause

Init container's migration script had no network timeout and was configured with the wrong database hostname (localhost instead of RDS endpoint), causing it to retry indefinitely.

The fix

Updated ConfigMap with correct DB_HOST and added a 60-second timeout to the migration command.

The lesson

Always add timeouts to init container commands and validate critical environment variables at startup.

( 08 )Understanding Init Container Lifecycle and Exit Codes

Init containers are exactly like regular containers except they run to completion before any app container starts. They share the same pod lifecycle: they can be OOMKilled (exit 137), can fail with non-zero exit codes, and can be restarted if the pod's restart policy is Always or OnFailure. The key difference is that if an init container fails, the pod never becomes Ready, even if the restart policy would restart the app container later.

When debugging, always check the exit code. Exit code 137 (SIGKILL) means OOM. Exit code 143 (SIGTERM) means the pod was terminated. Exit code 1 usually means a script error. Use kubectl describe and look for 'State: Terminated' with 'Reason: Error' or 'Reason: OOMKilled'. The 'Last State' field shows the previous run's details, which is critical for CrashLoopBackOff.

( 09 )Resource Limits and OOM: The Silent Killer

Init containers often perform heavy tasks like database migrations, data downloads, or asset compilation. If you don't set resource limits, Kubernetes can overcommit and the node may evict the pod. But if you set limits too low, the init container will get OOMKilled repeatedly. I've seen teams set 128Mi memory limit on a migration that needed 512Mi. The init container would start, hit the limit, get killed, restart, and repeat forever.

To diagnose, use kubectl top pod <pod> --containers to see actual usage. If the init container is not running, you can't use top. Instead, look at the 'Last State' exit code in describe—137 means OOM. Then increase the memory limit to at least 2x the observed usage from a local run. Also consider setting CPU limits to prevent throttling during bursty init tasks.

( 10 )Network Deadlocks and Dependency Ordering

A common pattern is an init container that waits for a service (e.g., database, cache) to be available. If that service is also deployed via Kubernetes and hasn't started yet (e.g., during a fresh deploy), the init container can block indefinitely. This creates a circular dependency if the service depends on the pod that the init container is part of.

The fix is to make the init container's retry logic have a finite timeout and a reasonable backoff. Use tools like 'curl --retry 5 --retry-delay 5' or a script with 'timeout 60'. Also, ensure that the service the init container depends on is deployed with higher priority (e.g., using init containers in that service as well? No—break the cycle by using a separate deployment or StatefulSet that must be healthy before the dependent deployment is created.)

( 11 )ConfigMap and Secret Resolution Failures

Init containers often rely on environment variables from ConfigMaps or Secrets. If the ConfigMap or Secret doesn't exist in the namespace, or if a key is missing, the init container may start but fail when it tries to use the variable. For example, a shell script that does 'set -u' will exit if a variable is undefined.

To check, run kubectl describe configmap <name> and kubectl describe secret <name>. Also look at the pod's spec under spec.initContainers[].env. Use kubectl exec into a debug pod and echo the variable to confirm its value. Another trick: add a step in the init container that checks all required env vars are non-empty and fails with a clear message.

Frequently asked questions

How do I see the logs of an init container that has already terminated?

Use kubectl logs <pod> -c <init-container-name> --previous. This shows the logs from the last terminated container instance (the one that crashed). Without --previous, you get the current (possibly empty) logs if it restarted.

Can I set a liveness probe on an init container?

No, init containers don't support liveness or readiness probes because they are expected to run to completion. If you need a health check, run the check as part of the init container's script and exit non-zero on failure. Alternatively, use a sidecar container pattern instead of an init container for long-running setup tasks.

What does 'Init:CrashLoopBackOff' mean?

It means the init container is repeatedly crashing (exiting with non-zero code), and Kubernetes is backing off before restarting it. The backoff doubles each time (10s, 20s, 40s... up to 5 minutes). This is a clear sign of a bug in the init container's command or configuration.

How do I debug an init container that hangs without crashing?

If the init container is stuck in Init:0/1 but not restarting, it's likely hanging. You can exec into the pod? No, because init containers don't allow exec. Instead, use kubectl logs -f to stream logs, or check the container's resource usage from the node (e.g., crictl stats). If you have access to the node, use 'crictl ps' to find the container and 'crictl logs' to see its output. Or redeploy with a debug init container that sleeps and exec into that.

Why does my init container work locally but not in Kubernetes?

Common differences: environment variables (missing or different), networking (localhost vs service DNS), file permissions on mounted volumes, or resource limits (local Docker has no limits by default). Always run the init container with the same image and command in a pod with similar resource constraints to replicate.

Debugging Kubernetes Init Containers That Never Complete

What this usually means

Frequently asked questions