When Kubelet Dies Silently: Diagnosing Container Runtime Failures
One of the most frustrating alerts in Kubernetes operations is seeing a node drop to a NotReady or NodeStatusUnknown state. When you run kubectl describe node, the only clue you get is:
Kubelet stopped posting node status.
Where do you even begin to look when the primary node agent dies silently?
The Dependent Agent
Through various incidents in my bare-metal homelab—ranging from Read-Only filesystem lockdowns to misconfigured network plugins—I've learned a critical architectural truth: The Kubelet does not exist in a vacuum.
The kubelet is entirely dependent on its Container Runtime Interface (CRI) socket to function. In modern setups, this is usually containerd, listening at /var/run/containerd/containerd.sock.
If the kubelet cannot communicate with that socket, it doesn't just log an error and wait. It intentionally crash-loops.
Common Culprits
When you SSH into a node suffering from a dead kubelet, your first command should rarely be systemctl restart kubelet. Instead, look down the stack:
If containerd is dead or unhealthy, here are the two most common reasons why:
1. The Hardware Lock
If the node's SSD suffers a panic and remounts as read-only, containerd immediately crashes because it can no longer write container logs or manage overlay filesystems. When containerd dies, the CRI socket disappears, and kubelet follows it to the grave.
2. The CNI Path Mismatch
I once deployed the Flannel Container Network Interface (CNI), and pods refused to leave the ContainerCreating state. Shortly after, the node dropped offline entirely.
Why? Because Debian's containerd package expects CNI plugins to live in /usr/lib/cni/, but the Flannel installer put them in the standard /opt/cni/bin/. containerd couldn't initialize the network, locked up, and dragged the kubelet down with it. A simple symbolic link fixed the entire chain:
Key Takeaway
Kubernetes is a layered architecture. When a high-level agent like the kubelet stops reporting, don't just restart it. Look at the foundation it stands on. If the container runtime is dead, the node is dead. Fix the runtime, and the kubelet will automatically recover.