Linux
Linux System Administration
The operating system is the foundation. Every setup task — from creating a user to configuring a container runtime — is a core Linux sysadmin skill.
Theory
The Linux filesystem hierarchy is standardized by the FHS. The directories you'll touch constantly:
| Path | Purpose | Common usage |
|---|---|---|
/etc |
System-wide configuration files | hosts, fstab, sudoers, sysctl, modules, containerd |
/home |
User home directories | leva's home, SSH keys |
/var |
Variable data (logs, runtime state) | containerd state |
/proc, /sys |
Virtual filesystems exposing kernel state | sysctl reads/writes |
User and group management. Linux is a multi-user OS. Every process runs as a user. Key commands:
useradd/adduser— create a userusermod -aG <group> <user>— add user to a groupgroups <user>— list group membershipsid <user>— show UID, GID, and groups
The sudoers system. sudo is not built into the kernel — it's a package. The configuration lives in /etc/sudoers (edited via visudo) and drop-in files in /etc/sudoers.d/. The line:
Means: user leva, from any host (ALL), may run commands as any user ((ALL)), without a password (NOPASSWD), for all commands (ALL).
Package management. Debian uses apt (and the lower-level dpkg):
apt-get update— refresh the package index from mirrorsapt-get install -y <pkg>— install without interactive confirmationapt-mark hold <pkg>— prevent a package from being upgraded (important for K8s tooling later)dpkg -l | grep <pkg>— check if a package is installed
Systemd is the init system and service manager on modern Debian. Key commands:
systemctl start/stop/restart <service>— control a servicesystemctl enable <service>— start on bootsystemctl status <service>— check healthjournalctl -u <service>— read service logs
Hardware & Storage Checks. You often need to verify physical hardware attributes (e.g., confirming a disk is an SSD for database or Longhorn deployments):
lsblk -d -o NAME,ROTA— list block devices and show if they are rotational (0= SSD,1= HDD). For more detail, see Checking if a Disk is an SSD or HDD.lsblk -o NAME,SIZE,MOUNTPOINT— check disk sizes and mount points to identify which disk is which.
Obstacles
- Debian minimal doesn't include
sudo. This is the first surprise on a netinst install. You mustsu -to root and install it manually. This was your first troubleshooting entry. visudovs. drop-in files. Never edit/etc/sudoersdirectly with a text editor — a syntax error locks you out of sudo. Usevisudofor validation, or use/etc/sudoers.d/drop-ins which are safer to manage.aptvsapt-get.aptis the user-friendly CLI (with progress bars).apt-getis the scriptable one (stable output, no prompts with-y). Useapt-getin scripts,aptinteractively.
Implementation
- troubleshooting.md — sudo: command not found
ansible/playbooks/00-bootstrap-debian.yaml— replaced manualprep-node.shscript to configure hostname and DNS.
Resources
- Debian Administrator's Handbook
man sudoers,man apt-get,man systemctl
Kernel Preparation for K8s
Kubernetes makes demands on the Linux kernel that go beyond typical server administration. Preparing a node requires configuring three core kernel-level systems: swap, kernel modules, and sysctl parameters.
Theory
The Linux kernel acts as the core interface between the hardware and your container runtime. To run Kubernetes reliably, you must manually adjust how the kernel handles memory and network traffic.
Swap
Swap is disk space used as overflow when RAM is full. The kernel moves inactive memory pages to swap to free up RAM. This is useful for general-purpose servers, but Kubernetes forbids it.
Why? The kubelet's job is to schedule pods with guaranteed resource limits. If a container requests 512 MB of RAM, the scheduler needs to know that 512 MB is physically available. If the OS silently swaps memory to disk, the scheduler's math becomes a lie — pods appear to fit but actually thrash on slow disk I/O.
Kernel Modules
Modules are pieces of kernel code loaded on demand. Two are required for container networking:
| Module | Purpose |
|---|---|
overlay |
Enables OverlayFS — the filesystem driver that layers container images. Each container sees a merged view of read-only image layers + a writable top layer. Without this, containerd can't unpack images. |
br_netfilter |
Makes bridged network traffic (traffic between containers on the same host via a Linux bridge) visible to iptables. Without this, Kubernetes NetworkPolicies and service routing can't inspect or filter inter-container packets. |
Load them immediately with modprobe, persist them in /etc/modules-load.d/k8s.conf.
Sysctl Parameters
sysctl exposes tunable kernel parameters via the /proc/sys/ virtual filesystem. Three parameters matter:
| Parameter | Value | Why |
|---|---|---|
net.ipv4.ip_forward |
1 |
Allows the node to forward packets between network interfaces. Without this, pods on one node can't reach pods on another — the kernel drops the packets instead of routing them. |
net.bridge.bridge-nf-call-iptables |
1 |
Bridged IPv4 traffic passes through iptables rules. Required for Kubernetes Services (kube-proxy) and NetworkPolicies to work on bridged traffic. |
net.bridge.bridge-nf-call-ip6tables |
1 |
Same as above for IPv6. |
These are persisted in /etc/sysctl.d/k8s.conf and applied with sysctl --system.
Obstacles
- "Why does K8s care about the kernel?" — Because Kubernetes doesn't run in a VM. It shares the host kernel with every container. The kernel is the container runtime's execution environment.
- Modules not persisting across reboot.
modprobeloads a module now./etc/modules-load.d/makes it survive reboots. Missing the second step is a classic "it worked until I rebooted" failure. - sysctl changes lost on reboot. Same pattern:
sysctl -wis temporary,/etc/sysctl.d/is permanent.
Implementation
ansible/playbooks/00-bootstrap-debian.yaml— declarative Ansible tasks managing swap, modules, and sysctl, replacing the imperativeprep-node.sh.
Resources
- Kubernetes docs — Container Runtimes prerequisites
- OverlayFS kernel docs
man sysctl,man modprobe
Proprietary Drivers and Kernels
While Kubernetes generally relies on standard kernel features, running hardware-accelerated workloads (like GPUs for transcoding or machine learning) requires proprietary kernel modules.
DKMS (Dynamic Kernel Module Support)
Unlike Windows, where hardware drivers are isolated pre-compiled binaries, Linux drivers are often compiled directly into the kernel or loaded as highly specific modules (.ko files) that must match the exact version of the running kernel.
When you install a proprietary driver (like nvidia-driver on Debian), the package manager downloads the raw C source code. It then uses DKMS to compile that source code into a kernel module locally on your machine.
- The Gotcha: DKMS cannot compile the module if it doesn't have the kernel headers (the C header files your specific kernel was built with). If you install the
nvidia-driverpackage without explicitly installinglinux-headers-amd64, DKMS will silently fail.sudo dkms statuswill show the module asaddedinstead ofinstalled, and the hardware will simply fail to initialize on boot.
Secure Boot
Secure Boot is a UEFI firmware feature designed to prevent malicious rootkits from loading during the boot process. It does this by cryptographically verifying the signature of the bootloader, the kernel, and every single kernel module.
- The Problem: Debian's core kernel and modules are signed by Microsoft/Debian keys, so the system boots perfectly fine with Secure Boot enabled. However, when DKMS locally compiles the proprietary
nvidia.komodule on your machine, it generates an unsigned binary. The Linux kernel will strictly refuse to load this unsigned driver, causing it to fail silently. - The Fix: For homelab and bare-metal environments running proprietary drivers, you must boot into the physical machine's BIOS (usually F2 or DEL) and explicitly disable Secure Boot.
Bare-Metal Hardware & SRE Recovery
When running your own bare-metal servers, you have to act as the hardware technician. Software doesn't just fail because of bad code; it fails because the physical silicon beneath it breaks.
The Read-Only Emergency Lockdown
If an SSD exhausts its write lifespans, or the Linux kernel detects a sudden memory panic or severe file table corruption, the kernel's first response is to protect your data by forcefully remounting the entire / filesystem as Read-Only.
Symptoms of a Read-Only Lockdown:
1. Ansible deployments fail with UNREACHABLE (Failed to create temporary directory).
2. The Kubernetes kubelet crash-loops because it can't write to /var/lib/kubelet.
3. Running a simple command like touch ~/test returns Read-only file system.
How to Recover:
1. The Hard Reboot: Because the filesystem is read-only, standard sudo reboot commands often fail because systemd can't write to the shut-down logs. You must physically hold the power button on the machine to hard-kill it.
2. Automated fsck: When the machine boots back up, the kernel runs fsck (filesystem check). If the corruption was minor, it fixes the journal and boots normally.
3. Manual fsck: If the machine boots but is still read-only, you must SSH in and forcefully repair the unmounted drive or read-only drive: sudo fsck -y /dev/sda1.
4. Hardware Death: If fsck reports the drive is clean, but a remount (sudo mount -o remount,rw /) throws a write-protected hardware lock error, the SSD controller has permanently bricked the drive.
Cryptographic Deprecations (The Sequoia Bug)
Modern Linux distributions frequently update their internal security policies, which can unexpectedly break legacy software repositories.
A prime example is the Debian 13 Sequoia Bug:
* Debian 13 switched to a strict OpenPGP verifier called sqv.
* sqv enforces modern cryptography, explicitly rejecting older "v3 signature packets".
* The official Kubernetes apt repositories were still signing their release files with v3 signatures.
* Result: apt-get update suddenly fails cluster-wide with Policy rejected packet type.
The Shell Wrapper Bypass:
When the package manager (apt) doesn't give you a flag to bypass a strict security tool, you can hijack the tool in the system $PATH.
By renaming /usr/bin/sqv to sqv.real, and placing a bash script at /usr/bin/sqv that intercepts the arguments and appends --policy-as-of 2025-01-01T00:00:00Z before passing them to sqv.real, you can trick the system into accepting the legacy signatures until the upstream repository updates their infrastructure!