LEARN · DEBUGGING GUIDE

Kubernetes Job Hangs: Debugging Pods That Never Complete

When a Kubernetes Job's pod never completes, it's almost always a process that won't exit, a probe that keeps the pod alive, or a resource deadlock. Here's how to find and fix each case.

IntermediateKubernetes6 min read

What this usually means

The root cause is that the main process inside the container never exits. This can happen because the application keeps a background thread alive, a signal handler traps SIGTERM, or the process forks a child that outlives the parent. Alternatively, the pod might be stuck in Pending due to resource constraints or persistent volume claims that never bind. Another common case is that a startup/liveness probe is configured with a very long initialDelaySeconds, causing the pod to be considered healthy forever even after the process finishes. Also, a missing terminationGracePeriodSeconds can cause the kubelet to force-kill only after a long default (30s).

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1kubectl describe job <job-name> | grep -E 'Conditions|Start Time|Completions'
  • 2kubectl logs <pod-name> --tail=50 --previous
  • 3kubectl get pod <pod-name> -o yaml | grep -A5 'containerStatuses' | grep -E 'state|started|ready'
  • 4kubectl exec <pod-name> -- ps aux # see if process is sleeping
  • 5kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchkubectl describe job <job-name> # check completion status and backoff
  • searchkubectl describe pod <pod-name> # look for Events and Conditions
  • searchkubectl logs <pod-name> --previous # see if previous run exited
  • searchkubectl get pvc # if Job uses persistent volumes, check binding
  • searchkubectl top pod <pod-name> # resource usage pattern
  • searchContainer runtime logs (journalctl -u kubelet on the node)
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningApplication process does not exit after completing work (e.g., HTTP server never stops)
  • warningStartup/liveness probe misconfigured with long initialDelaySeconds or threshold
  • warningPod stuck in Pending due to insufficient CPU/memory or PVC not binding
  • warningSIGTERM not caught by process; container still runs after main thread exits
  • warningJob spec sets backoffLimit but pod is still Running due to never exiting
  • warningImagePullBackOff or CrashLoopBackOff hidden by Job restarts
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAdd a proper shutdown mechanism: ensure the process exits after work (e.g., os.Exit(0) after main logic)
  • buildRemove or adjust probes: set initialDelaySeconds to 0 or remove probe if not needed
  • buildSet terminationGracePeriodSeconds to a low value (e.g., 5s) to force kill
  • buildFor Pending jobs: increase resources or fix PVC bindings
  • buildUse kubectl delete pod <pod> to force Job controller to restart
  • buildChange restartPolicy to Never and rely on backoffLimit for retries
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedkubectl get jobs -w shows completions incrementing
  • verifiedkubectl get pods --field-selector status.phase=Succeeded shows pod in Succeeded phase
  • verifiedkubectl logs <pod> shows application exit message
  • verifiedkubectl describe job shows Complete condition
  • verifiedkubectl get pod <pod> -o jsonpath='{.status.phase}' returns Succeeded
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningAssuming the pod should be Running after completion – it should be Succeeded
  • warningSetting liveness probes on batch jobs (they restart the pod indefinitely)
  • warningUsing restartPolicy: Always on a Job (it will keep restarting after completion)
  • warningNot checking previous container logs (--previous) for exit code
  • warningDeleting the Job instead of debugging the pod – you lose completion history
( 07 )War story

Data Processing Job Stuck at 3:00 AM

Platform EngineerKubernetes 1.24, Go service, AWS EKS, gp2 PVC

Timeline

  1. 03:00SRE alerts that nightly batch job 'data-transform' has been running for 45 minutes (expected 10 minutes)
  2. 03:05kubectl describe job shows 0 completions, 1 active pod
  3. 03:07Pod status is Running, no crash, no restart
  4. 03:10kubectl logs shows 'Processing file 5000 of 10000' and then nothing
  5. 03:12kubectl exec into pod, ps aux shows the Go binary sleeping in epoll_wait
  6. 03:15Check application code – it starts an HTTP health server that never shuts down
  7. 03:20Kill the HTTP server goroutine on completion, push fix
  8. 03:25Rolling update of Job, new pod completes in 8 minutes

I was paged at 3 AM because our nightly data transformation job hadn't finished in 45 minutes. The job normally takes 8-10 minutes to process 10,000 files. I ran kubectl get jobs and saw 0 completions, 1 active pod. The pod was Running, not CrashLoopBackOff. That was my first clue: the process was alive but not progressing.

I exec'd into the container and ran ps aux. The main binary was in an epoll_wait state. I checked the logs – it had printed 'Processing file 5000 of 10000' and then nothing. That meant the processing had stopped halfway. I looked at the application code: it spawned a goroutine to handle HTTP health checks for the liveness probe. The main processing loop finished but the health server kept running, so the binary never exited. The liveness probe kept returning 200, so Kubernetes thought everything was fine.

The fix was simple: add a channel to signal the HTTP server to shut down when processing completes, then wait for it to stop before exiting main(). We also removed the liveness probe from the Job spec since batch jobs don't need them. After redeploying, the job completed in 8 minutes. The lesson: always ensure your batch process actually exits, and don't use liveness probes on Jobs unless you have a good reason.

Root cause

Application started an HTTP server for health checks that never shut down, preventing the process from exiting.

The fix

Gracefully shut down the HTTP server after processing completes; removed unnecessary liveness probe from Job spec.

The lesson

Batch processes must terminate after work is done. Avoid liveness probes on Jobs – they can mask a stuck process.

( 08 )The Process Exit Chain in Kubernetes

When a container's main process (PID 1) exits, the kubelet marks the pod as Succeeded (exit code 0) or Failed (non-zero). But if the main process never exits, the pod stays Running forever. Common culprits: background goroutines/threads, daemonized processes, or signal handlers that trap SIGTERM and ignore it.

Check the container's entrypoint: if it's a shell script that spawns a background task and exits, the shell exits but the background process keeps running. Use exec to replace the shell with the application process so signals are delivered correctly.

( 09 )Probe Misconfigurations That Keep Pods Alive

A liveness probe with a long initialDelaySeconds (e.g., 300s) means the pod won't be restarted for 5 minutes – but if the application finishes and exits before the probe starts, the pod becomes Succeeded anyway. However, if the probe is configured with a very low failureThreshold and high periodSeconds, a stuck process might never be detected.

Startup probes can also delay pod termination. If a startup probe is defined with a long period, the kubelet waits for it to succeed before applying liveness/readiness. But if the probe never succeeds (e.g., port not listening), the pod is stuck in a 'not ready' state indefinitely. Solution: remove probes from Jobs entirely, or set them to check a completion marker.

( 10 )Resource Deadlocks and Pending Pods

If the Job pod stays in Pending, it's not stuck – it's waiting for resources. Use kubectl describe pod to see events: '0/1 nodes are available: 1 Insufficient cpu'. The Job controller will keep retrying until resources free up or backoffLimit is reached.

Persistent Volume Claims that never bind also cause Pending. Check kubectl get pvc – if it's Pending, the storage class may be missing or the volume provisioner is slow. Fix by pre-provisioning the volume or using a different storage class.

( 11 )Image Pull and CrashLoopBackOff Hidden by Jobs

A Job with backoffLimit can hide CrashLoopBackOff. The pod may appear in Running momentarily but then restart. Use kubectl get pods -w to see the restart count. Also check kubectl describe pod for 'Back-off restarting failed container' events.

ImagePullBackOff is another hidden cause: the Job tries to pull the image, fails, and retries. The pod status is 'ErrImagePull' or 'ImagePullBackOff', not Running. Check the image name and registry credentials.

( 12 )Using terminationGracePeriodSeconds Effectively

The default terminationGracePeriodSeconds is 30 seconds. If your application takes longer to shut down, the kubelet will force-kill it after that period. For batch jobs, set this to a low value (e.g., 5s) to speed up termination. But if the process ignores SIGTERM entirely, even a short grace period won't help – you need to fix the signal handling.

Test by sending SIGTERM manually: kubectl exec <pod> -- kill -TERM 1. If the pod doesn't exit within the grace period, the process is ignoring the signal. Use strace or check the signal handler in the code.

Frequently asked questions

Why does my Job pod show 'Completed' but the Job itself is not marked complete?

This can happen if the pod runs to completion but the Job controller hasn't updated the status yet due to a race condition. Wait a few seconds, then check again. If it persists, the Job controller might be stuck – you can delete the Job and recreate it.

My Job pod is stuck in 'ContainerCreating' – what's happening?

The container runtime is pulling the image or mounting volumes. Check kubectl describe pod for events like 'Failed to pull image' or 'Unable to mount volume'. Also check if the node has enough disk space or if the image registry is accessible.

Should I use liveness probes on batch Jobs?

Generally no. Liveness probes are for long-running services. If your batch job crashes, the Job controller will restart it based on restartPolicy. A liveness probe can restart a healthy container that's just processing slowly – or keep a zombie pod alive if the probe never fails.

What does 'backoffLimit' do for Jobs?

backoffLimit sets the number of retries before marking the Job as Failed. Each retry doubles the backoff delay (up to 6 minutes). If your pod never exits (stuck in Running), the backoffLimit doesn't apply because the pod is still active. Only pods that fail (exit non-zero) count toward the backoffLimit.

How do I force a stuck Job to complete?

You can delete the pod: kubectl delete pod <pod-name>. The Job controller will restart it. If the underlying issue is fixed, the new pod will complete. Alternatively, you can delete the Job entirely and recreate it. For stuck Jobs, you may need to manually set completions by patching the Job status (not recommended).