LEARN · DEBUGGING GUIDE

Kubernetes Pod Stuck in Pending: Debugging Resource and Scheduling Failures

When a pod stays in Pending, the scheduler can't find a suitable node. This guide walks through the exact commands to uncover the blocker—whether it's CPU/memory limits, taints, node selectors, or persistent volume claims.

IntermediateKubernetes9 min read

What this usually means

The Kubernetes scheduler has evaluated all nodes in the cluster and found zero nodes that satisfy the pod's scheduling constraints. These constraints include: insufficient CPU/memory (the pod requests more than any node's allocatable), node selector or affinity rules that don't match any node's labels, taints that the pod doesn't tolerate, or a PersistentVolumeClaim (PVC) that is not bound to a PV. In cloud environments with cluster autoscaler, Pending can also mean the autoscaler is provisioning a new node (which takes 1–5 minutes). The key is to read the Events section of `kubectl describe pod`—it will contain a specific reason string like 'Insufficient cpu', '0/2 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, 1 node(s) didn't match pod selector', or 'waiting for a volume to be created'. Each of these points to a different root cause.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1`kubectl describe pod <pod-name>` — check the Events section at the bottom for scheduling failures
  • 2`kubectl get events --sort-by='.lastTimestamp'` — view cluster-wide events for recent failures
  • 3`kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'` — see allocatable resources per node
  • 4If using cluster autoscaler: `kubectl logs -n kube-system deployment/cluster-autoscaler` for scaling decisions
  • 5Check PVC status: `kubectl get pvc` and `kubectl describe pvc <name>` for volume binding issues
  • 6Verify node selectors and affinities: `kubectl get pod <pod> -o yaml | grep -A 5 nodeSelector` and `... nodeAffinity`
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • search`kubectl describe pod <pod>` — the Events section is your first stop
  • search`kubectl get nodes -o wide` — check node conditions (MemoryPressure, DiskPressure, PIDPressure)
  • search`kubectl describe node <node>` — inspect allocatable resources, taints, and conditions
  • searchCluster autoscaler logs: `kubectl logs -n kube-system -l app.kubernetes.io/name=cluster-autoscaler`
  • searchPVC status: `kubectl get pvc` and `kubectl describe pvc <name>`
  • searchStorageClass configuration: `kubectl get sc` and `kubectl describe sc <name>`
  • search`kubectl get pod <pod> -o yaml` — check resource requests, limits, nodeSelector, tolerations, and affinity
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningInsufficient CPU or memory: sum of pod resource requests exceeds node allocatable
  • warningNode taints with no matching tolerations in the pod spec
  • warningNode selector or node affinity labels don't exist on any node
  • warningPVC that references a non-existent or non-provisioning StorageClass
  • warningPod anti-affinity rules that prevent co-location on any node
  • warningCluster autoscaler is scaling up but node creation is slow or failing (e.g., cloud provider quota)
  • warningResource quotas in the namespace preventing pod creation (not exactly Pending but can cause similar)
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildReduce resource requests/limits in the pod spec to fit available node capacity
  • buildAdd tolerations to the pod spec matching node taints (e.g., `tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"`)
  • buildAdjust nodeSelector labels or remove them if not needed
  • buildFix PVC/PV binding: create the PV or adjust StorageClass to support dynamic provisioning
  • buildRelax pod anti-affinity rules or increase node count
  • buildIf using cluster autoscaler, check cloud provider limits (e.g., AWS EC2 instance limits, GCP quota) and increase as needed
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRun `kubectl get pods` and confirm STATUS changes from Pending to ContainerCreating then Running
  • verifiedCheck `kubectl describe pod <pod>` Events for successful 'Scheduled' event
  • verifiedMonitor node conditions: `kubectl get nodes -o wide` shows no pressure conditions
  • verifiedIf the fix was resource-related, run `kubectl top nodes` to confirm utilization below allocatable
  • verifiedFor PVC issues: `kubectl get pvc` shows Bound status, and the pod's volume mounts are available
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningAssuming Pending always means resource shortage — check events before adding nodes
  • warningEditing a running pod's spec (must delete and recreate, or use Deployment rollout)
  • warningAdding unnecessary tolerations for master taints that compromise security
  • warningSetting CPU limits too low causing throttling after scheduling (even if pod becomes Running)
  • warningForgetting to update both Deployment and pod template when fixing
  • warningIgnoring cluster autoscaler logs when nodes are available but CA is not triggering
( 07 )War story

The Node Selector That Blocked a Production Deployment

Senior Platform EngineerKubernetes 1.28 on AWS EKS, cluster autoscaler, 4 nodes (t3.medium and t3.large)

Timeline

  1. 14:22Deployment rolling update triggered; new pods not starting
  2. 14:24`kubectl get pods` shows new pods stuck in Pending
  3. 14:25`kubectl describe pod <new-pod>` shows '0/4 nodes are available: 4 node(s) didn't match pod selector'.
  4. 14:26Check pod YAML; find nodeSelector: disktype: ssd
  5. 14:27`kubectl get nodes --show-labels` confirms no node has disktype=ssd label
  6. 14:28Developer confirms the label was supposed to be added via a previous change but was never applied
  7. 14:30Remove nodeSelector from Deployment spec and re-apply
  8. 14:32New pods scheduled and running within 30 seconds

I was on call when a deployment rollout stalled. The team had added a nodeSelector to pin pods to SSD nodes for a new feature, but they'd never labeled the nodes. The pods were pending because no node matched the selector. The events message was clear: '4 node(s) didn't match pod selector'. I checked the pod YAML and saw the selector. Then I checked node labels—none had disktype=ssd.

I asked the developer who made the change. They said they'd used an older branch that had the label logic removed. The fix was simply removing the nodeSelector from the deployment since the feature didn't actually require SSD. I rolled back the selector and the pods scheduled immediately.

The lesson: always verify that labels referenced in nodeSelector or affinity exist on nodes before deploying. A pre-commit hook or a dry-run validation could have caught this. Also, never assume the cluster autoscaler is the issue—check the events first.

Root cause

Node selector `disktype: ssd` in pod spec with no node having that label.

The fix

Removed the nodeSelector from the Deployment; pods scheduled on any node.

The lesson

Always verify node labels exist before using nodeSelector or nodeAffinity. Use `kubectl get nodes --show-labels` to confirm.

( 08 )Reading the Events Message: Why the Scheduler Says No

The scheduler emits events with specific reasons. Common ones include: `Insufficient memory` (pod requests > allocatable), `Insufficient cpu`, `0/X nodes are available: X node(s) had taint {TaintKey:Value}, X node(s) didn't match pod selector`, and `waiting for a volume to be created`. Each reason narrows the search. For taints, the event lists the exact taint key/value/effect. For node selector, it won't tell you which selector failed—you have to check the pod spec. For PVC, the event will say 'persistentvolumeclaim not found' or 'volume is already used by pod(s)'. The exact text is the most actionable piece of information.

If the event says '0/4 nodes are available: 4 Insufficient cpu', then you know the sum of CPU requests across all pods on any node exceeds allocatable. Check `kubectl top nodes` and `kubectl describe node` for current usage. Sometimes the issue is a single node with a large request that can't fit elsewhere due to pod anti-affinity or even a previous pod's resources not being freed (e.g., terminated pods with graceful shutdown delays).

( 09 )Cluster Autoscaler: When Pending Is Actually Good

In cloud environments, if no node can fit the pod, the cluster autoscaler (if enabled) will attempt to add a node. During this process, the pod remains Pending. The expected behavior: within a few minutes (typically 1-5 depending on cloud provider), a new node appears and the pod schedules. However, if the autoscaler is misconfigured or hits a limit (e.g., AWS EC2 instance limit, GCP quota), the pod stays pending indefinitely. Check autoscaler logs: `kubectl logs -n kube-system deployment/cluster-autoscaler`. Look for lines like 'no node groups can accommodate the pod' or 'scale-up failed: ...'. If you see 'scale-up is in progress', wait longer.

A common pitfall: the autoscaler may not scale up if the pod has restrictive node selectors or affinities that no existing node group matches. In that case, the autoscaler will log that no node group can accommodate the pod. You may need to update the node group configuration or add a new node group with the required labels. Also, ensure the autoscaler has permissions to modify the node group (IAM roles on AWS, service accounts on GCP).

( 10 )Understanding Resource Requests vs Limits vs Allocatable

The scheduler only considers requests, not limits, when placing pods. A node's allocatable resources are its capacity minus reserved resources (system daemons, kubelet overhead). To see allocatable: `kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable}'`. The pod's requests must be ≤ allocatable on the node. If you set limits only (no requests), Kubernetes defaults requests equal to limits, which can cause unexpected scheduling failures. Always set requests explicitly.

Sometimes a node appears to have free resources but pods still fail to schedule. Check `kubectl describe node <node>` for conditions like MemoryPressure or DiskPressure—these prevent scheduling even if allocatable appears sufficient. Also, consider resource fragmentation: a node may have 4Gi memory allocatable but only in 2Gi chunks due to other pods' reservations, so a pod requesting 3Gi won't fit. Use `kubectl top node` to see actual usage, but remember that requests matter for scheduling, not usage.

( 11 )PVC and StorageClass Failures: The Hidden Pending Culprit

A pod with a volume claim can stay Pending if the PVC is not bound. The scheduler will not schedule the pod until the PVC is bound. Check `kubectl get pvc`; if STATUS is Pending, the persistent volume claim cannot be satisfied. Use `kubectl describe pvc <name>` to see events: 'no persistent volumes available for this claim' or 'Failed to provision volume with StorageClass'. The first means no PV exists matching the claim; the second means the dynamic provisioner is failing. Check the StorageClass: `kubectl get sc` and ensure the provisioner is correct and the cloud provider is not having issues. For example, in AWS EKS, if the StorageClass uses `kubernetes.io/aws-ebs` but the CSI driver is not installed, provisioning will fail.

Another subtle issue: the PVC may be bound but the pod still pending because the volume is being used by another pod (multi-attach error). The event will say 'volume is already used by pod(s)'. In that case, you need to ensure the access mode is ReadWriteMany (RWX) if multiple pods need to write, or use a different volume. Also, if the PVC is set to `WaitForFirstConsumer`, the volume will not be provisioned until a pod is scheduled, creating a circular dependency—the pod waits for volume, volume waits for pod. The fix is to set the volume binding mode to `Immediate` or ensure the scheduler can provision the volume first (requires CSI driver support).

( 12 )Advanced Scheduling: Pod Affinity, Anti-Affinity, and Topology Spread Constraints

Complex scheduling rules can cause Pending even when resources are abundant. Pod anti-affinity with `requiredDuringSchedulingIgnoredDuringExecution` prevents two pods from being on the same node. If you have N replicas and N nodes, but each pod has anti-affinity with itself, the scheduler can place only one per node—works. But if you have more replicas than nodes, the extra pods will stay Pending. The event will say 'X node(s) didn't match pod anti-affinity rules'.

Similarly, topology spread constraints can require pods to be spread across zones or hosts. If you have 3 zones but only 2 zones have nodes, or if the constraints are too restrictive (e.g., maxSkew=1 with more replicas than nodes), pods will pend. Check the pod YAML for `topologySpreadConstraints` and `podAntiAffinity`. The event message often includes the specific constraint that failed. To fix, you can relax the constraints (e.g., change `requiredDuringScheduling` to `preferredDuringScheduling`) or add more nodes/zones.

Frequently asked questions

What does '0/4 nodes are available: 4 Insufficient cpu' mean exactly?

It means that on every node, the total CPU requests of all pods (including the pending one) exceed the node's allocatable CPU. The scheduler cannot fit the pod on any node. To fix, either reduce the pod's CPU request, increase node size, or add more nodes.

How do I check if a PVC is causing the pod to be pending?

Run `kubectl describe pod <pod-name>` and look for events like 'persistentvolumeclaim "<claim-name>" not found' or 'waiting for a volume to be created'. Then check the PVC with `kubectl get pvc` and `kubectl describe pvc <claim-name>`. If the PVC is pending, it's the root cause.

Why does my pod stay pending even though nodes have free resources?

If nodes have free resources but the pod is still pending, check for node selectors, affinities, taints, or topology constraints. Use `kubectl describe pod` to see the exact event message. It will tell you if the pod didn't match node labels, lacked tolerations, or violated anti-affinity rules.

Can cluster autoscaler cause a pod to be pending indefinitely?

Yes, if the autoscaler cannot add a node (e.g., instance limit reached, insufficient quota, or no node group matching the pod's requirements). Check autoscaler logs: `kubectl logs -n kube-system deployment/cluster-autoscaler`. Look for lines like 'no node groups can accommodate the pod' or 'scale-up failed'.

How do I fix a pod stuck in pending due to taints?

Add tolerations to the pod spec matching the node's taint. For example, if a node has taint `key=value:NoSchedule`, add `tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule"`. You can also remove the taint from the node if it's not needed.