Skip to content

The Read-Only Lockdown: Recovering a Bare-Metal Node from a Kernel Panic

What do you do when your automated deployment pipeline throws an UNREACHABLE error on a machine that you can successfully SSH into? This is the exact scenario I faced recently when bringing up a third worker node in my bare-metal Kubernetes cluster.

The Incident

I was running my standard suite of Ansible playbooks to standardize the configuration across the cluster. When the playbook hit k8s-worker-01 (an older Dell laptop repurposed as a worker node), Ansible threw a confusing error:

UNREACHABLE! Failed to create temporary directory

This error usually indicates a complete network drop or an SSH daemon failure. But when I manually SSH'd into k8s-worker-01, it worked perfectly. I had a prompt. The machine was alive.

So, why couldn't Ansible create a temporary directory?

The Diagnosis

I attempted a simple file creation test:

touch ~/test

The terminal spat back:

touch: cannot touch '/home/leva/test': Read-only file system

The entire root filesystem (/) had been locked.

I checked the kernel ring buffer for clues using dmesg | tail -n 30 and immediately saw a string of EXT4-fs error messages. The older SSD had suffered a hardware-level fault or a kernel memory panic. To protect the integrity of the data and prevent further corruption, the Linux kernel purposefully remounted the entire drive as read-only.

The Recovery (And Why Rebooting Failed)

My first instinct, as is tradition in IT, was to turn it off and on again.

sudo reboot

The command hung indefinitely. Because the filesystem was read-only, systemd was completely unable to write the shutdown logs or update the runlevel states required to gracefully halt the OS. The machine was in a zombie state.

I had to resort to a physical hard-power cycle—holding the power button until the machine died.

When I booted it back up, the kernel attempted an automated fsck to repair the ext4 journal. Unfortunately, it wasn't enough. The drive was still mounted as read-only.

I had to force a manual, aggressive filesystem check on the unmounted partition:

sudo fsck -y /dev/sda1

After several minutes of fixing orphaned inodes and corrupted blocks, fsck reported the drive was clean. I executed sudo mount -o remount,rw / and, finally, the filesystem opened back up. The node successfully rejoined the Kubernetes cluster and containerd came back online.

Key Takeaway

In the cloud, hardware failures are abstracted away. A dead VM is just terminated and replaced by an Auto Scaling Group. In a bare-metal environment, you are the final line of defense against failing SSDs and kernel panics.

Always check dmesg when weird I/O or "temporary directory" errors occur, and remember that when the filesystem locks, even standard shutdown commands will fail you.