Filesystems and SRE Recovery

When managing physical servers or bare-metal infrastructure, you must eventually act as a hardware technician. Software doesn't just fail because of bugs or bad network connectivity; it fails because the physical silicon beneath it breaks, loses power, or wears out.

Understanding how Linux interacts with physical disks through filesystems is a critical DevOps and Site Reliability Engineering (SRE) skill.

Ext4 and Journaling

Modern Linux distributions (like Debian) default to the ext4 filesystem. Ext4 is a "journaling" filesystem, meaning it keeps a log (the journal) of changes it intends to make before it actually commits them to the main filesystem tree.

Why Journaling Matters

Imagine your database is in the middle of writing 50 files to disk when a sudden power outage occurs. * On an older non-journaling filesystem (like ext2), the filesystem tree would be left in an inconsistent, mangled state. You would lose files, or worse, the entire partition would become unreadable. * With ext4, when the machine boots back up, it looks at the journal. It sees "I was halfway through this write operation when I lost power." It can then replay the journal to safely complete the write, or roll it back to a clean state, preventing catastrophic data corruption.

The Read-Only Emergency Lockdown

While journaling protects against sudden power loss, it cannot fix failing physical hardware (like an SSD losing flash sectors).

When the Linux kernel encounters a severe block error or memory panic while trying to write to the disk, its primary directive is: Prevent Data Loss. If it cannot trust the physical medium, it will forcefully remount the entire / (root) filesystem as Read-Only.

Symptoms of a Read-Only Lockdown

A read-only lockdown brings down the entire application stack: 1. Automation Fails: Ansible or bash scripts will suddenly fail with cryptic Permission denied or UNREACHABLE: Failed to create temporary directory errors. 2. Kubernetes Crashes: The kubelet agent will crash-loop because it can no longer write to its local /var/lib/kubelet database. It drops off the cluster as NodeStatusUnknown. 3. Database Corruption: Databases like PostgreSQL will immediately halt to protect their data.

Diagnosis

To confirm a read-only lockdown: 1. Attempt to create a file: touch ~/test. If it fails with Read-only file system, you are locked down. 2. Check the kernel logs: dmesg | tail -n 50. You will likely see bright red EXT4-fs error or media error messages.

The SRE Recovery Playbook

Recovering a node from a read-only lockdown requires a strict order of operations:

Step 1: The Hard Reboot

Because the filesystem is read-only, standard shutdown sequences fail (they cannot write to the shutdown logs). You often cannot use sudo reboot. You must physically hard-power the machine off by holding the power button.

Step 2: Automated fsck

When the machine powers back on, the Linux bootloader detects that the filesystem was not cleanly unmounted. It automatically runs fsck (Filesystem Check). If the corruption is minor, fsck replays the journal, fixes the inodes, and the machine boots perfectly into a Read/Write state.

Step 3: Manual fsck

If fsck finds major inconsistencies during boot, it will refuse to automate the fix (lest it delete important data). It will drop you into an (initramfs) emergency terminal, or it will boot the system but keep it mounted as Read-Only. To manually force a repair:

sudo fsck -y /dev/sda1

(The -y flag answers "yes" to all repair prompts).

Step 4: Hardware Death

If fsck reports that the filesystem is completely clean, but running sudo mount -o remount,rw / throws an error like cannot remount block device is write-protected, you have reached the end of the line.

When modern SSDs exhaust their physical write-cycles, the SSD firmware permanently locks the drive at the hardware level. To the kernel, the data is completely intact and readable, but the physical drive simply refuses to accept write commands. At this point, you must physically replace the drive.