What this usually means
The kubelet on the node has stopped reporting heartbeats to the control plane, or the control plane has marked the node as unhealthy. Common underlying causes fall into three categories: (1) the kubelet process is dead, hung, or crashing; (2) a system condition (disk, memory, PID) has triggered the node to self-terminate; (3) the container runtime (containerd, CRI-O) or CNI plugin is broken, preventing pod lifecycle operations. The textbook fix is to restart the kubelet, but that papering over the real problem. In production, you need to identify whether the issue is transient (e.g., a spike in disk I/O) or permanent (e.g., a kernel bug in the network driver).
The first ten minutes — establish facts before touching code.
- 1kubectl get nodes -o wide → note when the node was last HeartbeatTime. If more than 40 seconds ago, kubelet isn't reporting.
- 2kubectl describe node <node> | grep -A5 Conditions → check which condition is False vs True. NodeReady=False means kubelet thinks it's unhealthy.
- 3ssh <node> 'systemctl status kubelet' → check if kubelet is active and its last restart time. If it's restarting frequently, look at journald.
- 4journalctl -u kubelet -n 100 --no-pager | grep -i error → scan for common errors: 'PLEG is not healthy', 'failed to connect to containerd', 'out of disk space'.
- 5ssh <node> 'df -h /var/lib/kubelet' → check disk usage on the kubelet data directory. If >80%, you might have disk pressure.
- 6ssh <node> 'curl -sk https://localhost:10250/healthz' → kubelet healthz endpoint. Expect 200. Anything else means kubelet internal failure.
The specific files, logs, configs, and dashboards that usually own this bug.
- search/var/log/kubelet/kubelet.log or journalctl -u kubelet -f
- search/var/log/pods/ and /var/log/containers/ for container runtime errors
- searchkubectl describe node <node> → Conditions, Capacity, Allocatable, System Info
- searchkubectl get events --field-selector involvedObject.kind=Node | grep <node>
- searchss -tlnp | grep 10250 → check if kubelet is listening on its health port
- searchcrictl pods or crictl ps -a → verify containerd can list pods/containers
- search/etc/kubernetes/kubelet.conf and /var/lib/kubelet/config.yaml for misconfigurations
Practical causes, not theory. These are the things you will actually find.
- warningCNI plugin binary missing or incompatible after node reboot or upgrade
- warningDisk pressure from log rotation failure filling /var/log or /var/lib/kubelet
- warningContainer runtime (containerd/CRI-O) deadlocked due to a bug or resource exhaustion
- warningkubelet's PLEG (Pod Lifecycle Event Generator) stuck because of a broken container runtime socket
- warningNode has too many pods (>110 per node by default) causing memory pressure on kubelet
- warningKernel module missing for network adapter (e.g., flannel using vxlan but vxlan module not loaded)
- warningCertificate expiration for kubelet client certs causing authentication failure to API server
Concrete fix directions. Pick the one that matches your root cause.
- buildIf CNI plugin failed: reapply CNI manifests (kubectl apply -f <cni.yaml>) and restart kubelet
- buildIf disk pressure: delete unused container images (crictl rmi --prune), clean up journal logs (journalctl --vacuum-size=500M), or expand disk
- buildIf containerd deadlocked: restart containerd (systemctl restart containerd) then kubelet
- buildIf PLEG unhealthy: check containerd socket permissions, restart containerd, then kubelet
- buildIf kubelet certificate expired: renew with kubeadm alpha certs renew kubelet-client or regenerate node bootstrap token
- buildIf too many pods: increase node maxPods in kubelet config or scale out pods across more nodes
A fix you cannot prove is a guess. Close the loop.
- verifiedkubectl get nodes -w → watch node transition to Ready
- verifiedkubectl describe node <node> | grep Ready → confirm condition is True
- verifiedssh <node> 'curl -sk https://localhost:10250/healthz' → should return 200
- verifiedkubectl get pods -o wide | grep <node> → verify pods are running, not Pending
- verifiedCheck kubelet journal for the absence of error messages after fix
- verifiedRun a canary pod on the node: kubectl run test --image=nginx --restart=Never --node-name=<node>; kubectl delete pod test
Things that make this bug worse or harder to find.
- warningDon't restart kubelet without checking logs first — you lose the evidence
- warningDon't assume it's a kubelet problem when it's actually a runtime problem (check containerd first)
- warningDon't drain the node without understanding why it's NotReady — you might make the node unschedulable permanently
- warningDon't ignore disk pressure because 'df shows 50%' — check inode usage with df -i
- warningDon't blindly reinstall kubelet; the same config will cause the same failure
- warningDon't forget to check if the node itself is healthy (ssh, ping, dmesg for hardware errors)
NotReady After CNI Upgrade: A VXLAN Horror Story
Timeline
- 09:15PagerDuty alert: 12 nodes in production cluster go NotReady simultaneously
- 09:18kubectl get nodes confirms 12 nodes NotReady, all in us-east-1b
- 09:20SSH to one node: systemctl status kubelet shows active but journalctl shows 'PLEG is not healthy'
- 09:22Check containerd: crictl pods hangs indefinitely
- 09:25Restart containerd: hangs during shutdown, force kill with kill -9 required
- 09:30After containerd restart, crictl works, but kubelet still NotReady
- 09:35kubectl describe node shows NetworkUnavailable=True. Calico logs show 'failed to create vxlan tunnel'
- 09:40Check kernel: lsmod | grep vxlan → empty. vxlan kernel module not loaded after reboot.
- 09:45Load module: modprobe vxlan. Calico recovers. Node goes Ready after 30 seconds.
- 09:50Add vxlan to /etc/modules-load.d/ to survive reboot. Apply to all affected nodes.
- 10:00All nodes Ready. Root cause: AMI update removed vxlan module from initramfs.
The morning started with a bang. Twelve nodes in our production cluster went NotReady within a minute of each other. My first instinct was to check for a network partition, but the API server was reachable. I SSH'd into one node and ran systemctl status kubelet — it was active, but the logs were screaming 'PLEG is not healthy'. That's usually a sign that the container runtime is broken, not kubelet itself.
I tried crictl pods and it hung. containerd was deadlocked. I had to force kill it with kill -9, then restart it. After that, crictl worked, but kubelet still wouldn't report Ready. The node conditions showed NetworkUnavailable=True. I checked Calico's logs: 'failed to create vxlan tunnel'. A quick lsmod | grep vxlan confirmed the kernel module wasn't loaded.
It turned out our latest AMI build had removed the vxlan module from initramfs. After a reboot (which happened during a routine security patching), the module wasn't available, and Calico couldn't create its overlay network. Loading the module with modprobe vxlan fixed one node instantly. We then added it to /etc/modules-load.d and pushed the fix to all 12 nodes. Lesson learned: always verify kernel module persistence after AMI changes.
Root cause
Kernel module vxlan not loaded after AMI update, breaking Calico CNI overlay network.
The fix
Loaded vxlan module and added it to /etc/modules-load.d. Also validated that the AMI had the module installed.
The lesson
Always test CNI functionality after any node image update. Automate kernel module verification in your node bootstrap script.
PLEG is the internal kubelet component responsible for detecting pod state changes. When kubelet logs show 'PLEG is not healthy', it means the relisting loop that queries the container runtime has failed repeatedly. This often happens when the container runtime socket is slow or unresponsive.
To diagnose, check the containerd/CRI-O logs and the kubelet's PLEG-related metrics: kubectl get --raw /api/v1/nodes/<node>/proxy/metrics | grep pleg. A high pleg_discard_events count indicates the runtime is taking too long to respond. Common fixes: restart the container runtime, increase kubelet's node-status-update-frequency, or check for I/O pressure on the runtime storage.
Kubernetes eviction manager monitors disk usage on the root partition and on the kubelet data directory. If usage exceeds the eviction threshold (default 85%), the node sets DiskPressure=True and eventually evicts pods. However, I've seen cases where a filesystem with many small files (e.g., /var/log) hits the inode limit before the disk space limit.
Always check both df -h and df -i. If inodes are exhausted, delete old logs or adjust the eviction threshold via kubelet config: --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-soft=imagefs.available<10%,nodefs.available<10%. Also consider using a dedicated partition for /var/lib/kubelet to avoid interference from other system logs.
A common cause of NetworkUnavailable is a mismatch between the CNI plugin version and the kubelet version. For example, Calico v3.25 requires certain iptables rules that are not compatible with older kernels. Another case: the CNI binary (e.g., /opt/cni/bin/calico) is missing or corrupted after a node upgrade.
Verify with: ls -la /opt/cni/bin/ and check for the expected binaries. Reapply the CNI manifest with kubectl apply -f <cni.yaml>. In some cases, you need to restart kubelet after the reapply. If using a network policy controller, also check its logs for errors about missing dependencies.
If kubelet's client certificate expires, the API server stops accepting heartbeats, and the node goes NotReady. Symptoms include 'Unauthorized' errors in kubelet logs and 'x509: certificate has expired' in the API server audit logs. The node status will show LastHeartbeatTime older than 40 seconds.
Check certificate expiry with: openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates. If expired, use kubeadm alpha certs renew kubelet-client (for kubeadm clusters) or regenerate the certificate via the bootstrap token. For managed clusters (EKS, GKE), you may need to re-create the node.
containerd or CRI-O can enter a deadlock state where it holds a lock on the socket and never responds. This is often triggered by a bug in the runtime itself (e.g., containerd issue #5600) or by resource starvation (e.g., too many concurrent pod operations). In this state, kubelet's PLEG will fail, and the node goes NotReady.
Detection: try crictl pods — if it hangs, the runtime is deadlocked. Force restart with systemctl restart containerd (may need kill -9). To prevent recurrence, limit the number of concurrent pod operations via kubelet config: maxPods=30 and set containerd's max_concurrent_downloads appropriately.
Frequently asked questions
Why does my node go NotReady after a reboot?
Common causes: CNI module not loaded (vxlan, bridge), containerd not started before kubelet, or kubelet certificate not regenerated. Check kernel modules with lsmod, ensure containerd is enabled (systemctl enable containerd), and verify kubelet certs are valid after reboot.
What does 'PLEG is not healthy' mean?
PLEG (Pod Lifecycle Event Generator) is the kubelet component that watches for pod changes via the container runtime. When it's unhealthy, it means the runtime socket is not responding fast enough or is broken. Check containerd/CRI-O status, restart the runtime, and verify the socket file exists (/run/containerd/containerd.sock).
Can high pod density cause NotReady?
Yes, if a node exceeds its allocatable resources (CPU, memory, PID, disk). Kubernetes evicts pods based on eviction thresholds. If the node has too many pods, the kubelet may become unresponsive. Use kubectl top nodes and check node status conditions (MemoryPressure, PIDPressure). Increase node size or reduce pod count.
How do I force a node to become Ready?
You can't force it; you must fix the underlying issue. However, if you know it's a transient condition, you can restart kubelet (systemctl restart kubelet) which may trigger a re-registration. If that doesn't work, check the kubelet logs for the real reason. Never use kubectl edit node to remove the taint — that only hides the problem.
What if the node is NotReady but pods are running?
This can happen if the API server loses connectivity to the node but the kubelet still runs pods. The pods will continue but cannot be managed. Check network connectivity from the control plane to the node (port 10250). Also check if the kubelet certificate is valid. You may need to reboot the node to re-establish the connection.