Kubernetes

Installation Process

Note on Infrastructure as Code: The steps below originally mapped to a bash script (scripts/install-k8s.sh). They have since been migrated to a declarative Ansible playbook (ansible/playbooks/02-install-k8s.yaml). The underlying theory remains exactly the same.

Here is a detailed breakdown of exactly what must happen to install Kubernetes components on Debian (homelab environment) from scratch:

1. Root User Check

if [ "$EUID" -ne 0 ]; then
  echo "Please run as root (or with sudo)"
  exit 1
fi

Why: Installing packages and adding system-level repositories requires administrator privileges. This check ensures the script doesn't fail halfway through because it wasn't run with sudo.

2. Bypassing Debian 13 Signature Policy (Sequoia v3)

The Root Cause: Debian 13 ships with Sequoia, a modern Rust-based OpenPGP implementation, as its default signature verifier (via sqv). In early 2026, Sequoia enforced a long-announced deprecation: OpenPGP v3 signature packets are no longer accepted as of 2026-02-01T00:00:00Z. The Kubernetes apt repository still signs its InRelease files with a v3 signature packet (the older, pre-RFC 4880 format), causing Sequoia to hard-reject it.

if command -v sqv &>/dev/null; then
  if [ ! -f /usr/bin/sqv.real ]; then
    mv /usr/bin/sqv /usr/bin/sqv.real
  fi
  cat > /usr/bin/sqv <<'EOF'
#!/usr/bin/env bash
exec /usr/bin/sqv.real --policy-as-of 2025-01-01T00:00:00Z "$@"
EOF
  chmod +x /usr/bin/sqv
fi

trap '[ -f /usr/bin/sqv.real ] && mv /usr/bin/sqv.real /usr/bin/sqv' EXIT

How this wrapper fixes the issue:

command -v sqv checks whether sqv is present (more portable than checking a config file path).
The real binary is renamed to sqv.real (only on the first run, to avoid double-renaming on reruns).
A shell wrapper is written in its place. It prepends --policy-as-of 2025-01-01T00:00:00Z to every invocation, which tells sqv to evaluate the policy as of a pre-deprecation date, and forwards all original arguments with "$@".
exec replaces the shell process with sqv.real directly (no subshell overhead, clean process table).
The trap ... EXIT runs on any exit (success, failure, or Ctrl+C). This ensures the real sqv binary is always restored to its original state so the system isn't left with a patched binary after the script finishes.

(Note: We use this wrapper because apt doesn't support passing custom flags to sqv directly, and we don't want to disable security entirely by falling back to [trusted=yes].)

3. Installing Prerequisites

apt-get update
apt-get install -y apt-transport-https ca-certificates curl gpg

Why: Out of the box, Debian's package manager (apt) might not be fully equipped to download packages securely over HTTPS or to verify custom digital signatures.

apt-transport-https and ca-certificates allow apt to securely connect to the Kubernetes servers.
curl is used to download the security keys.
gpg is used to process those security keys.

4. Adding the Official Kubernetes Repository

mkdir -p /etc/apt/keyrings
rm -f /etc/apt/keyrings/kubernetes-apt-keyring.gpg
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.31/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --yes
chmod 644 /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.31/deb/ /" | tee /etc/apt/sources.list.d/kubernetes.list > /dev/null

Why: Debian's default software repositories do not include Kubernetes. We have to tell Debian exactly where to download it from.

First, we download the GPG signing key from Google/Kubernetes. This ensures that the packages we download haven't been tampered with by a malicious third party. We store it in /etc/apt/keyrings/ which is the modern secure location.
Second, we add the actual URL for the v1.31 repository into Debian's list of software sources (sources.list.d) and pin it strictly to the downloaded key using signed-by=.

5. Installing the Core Components

apt-get update
apt-get install -y kubelet kubeadm kubectl

Why: This installs the holy trinity of Kubernetes cluster building:

kubelet: This is the primary "node agent" that runs on every single machine in the cluster. It talks to your container runtime (containerd from Phase 1) and makes sure your containers are actually running.
kubeadm: This is the "bootstrap" tool. You will use this to run kubeadm init on the ROG (to create the cluster) and kubeadm join on the Dell (to connect it to the ROG).
kubectl: This is the command-line interface. It's how you talk to the cluster once it's built to tell it to deploy applications, check logs, etc.

6. Pinning the Package Versions (Extremely Important)

apt-mark hold kubelet kubeadm kubectl

Why: If you ever run apt-get upgrade on your servers in the future, Debian will automatically upgrade all installed software. You do not want Debian to automatically upgrade Kubernetes. Upgrading a Kubernetes cluster must be done deliberately and carefully (one node at a time). If apt upgraded kubelet randomly in the background, it could break your cluster. apt-mark hold tells Debian: "Never upgrade these three packages unless I explicitly remove this hold."

7. Enabling the Kubelet Service

systemctl enable --now kubelet

Why: This tells systemd (Debian's service manager) to ensure that the kubelet process starts automatically every time the server reboots. (Note: The kubelet will actually crash loop right now if you check its status, which is normal—it's waiting for you to run kubeadm to tell it what to do!).

8. Verification

Once the script completes, you can verify that the client components were installed correctly:

$ kubectl version --client
Client Version: v1.32.3
Kustomize Version: v5.5.0

Why: This confirms that kubectl is installed and in your system's PATH. (Note: It only checks the client version right now because the cluster control plane hasn't been initialized yet.)

Cluster Bootstrap (Homelab - Phase 2)

After installing the core components on the nodes, the next step is to initialize the Control Plane and prepare the cluster for workloads.

1. The `kube-vip` "Chicken and Egg" Problem

We use kube-vip to provide a highly-available Virtual IP (VIP) for the Kubernetes API server (192.168.1.50). However, starting in Kubernetes 1.29+, stricter RBAC rules create a deadlock when deploying kube-vip as a static pod:

kubeadm init needs to talk to the VIP to initialize the cluster.
kube-vip needs kubeadm to finish so the RBAC super-admin rules exist before it can bind the VIP via leader election.

The Solution (init-control-plane.sh): We solve this by manually binding the VIP to the network interface before running kubeadm init.

# 1. Manually add the VIP so kubeadm can reach the API server locally during bootstrap
ip addr add "192.168.1.50/32" dev "enp4s0" || true

# 2. Initialize cluster with kubeadm using the VIP
kubeadm init --control-plane-endpoint "192.168.1.50:6443" --upload-certs --pod-network-cidr "10.244.0.0/16"

# 3. Generate the kube-vip static pod manifest AFTER kubeadm init
ctr run --rm --net-host "ghcr.io/kube-vip/kube-vip:v0.8.0" vip /kube-vip manifest pod \
    --interface "enp4s0" \
    --address "192.168.1.50" \
    --controlplane --services --arp --leaderElection > /etc/kubernetes/manifests/kube-vip.yaml

Once the static pod starts, kube-vip takes over management of the VIP automatically.

2. Node Labeling (`label-nodes.sh`)

Nodes should be semantically labeled so workloads can be scheduled intelligently (e.g., ensuring a database pod only runs on a node with an SSD).

The label-nodes.sh script applies labels using the --overwrite flag. This makes the script fully idempotent, meaning it can be run multiple times safely without throwing an error if the label already exists.

kubectl label node k8s-worker-01 node-role.kubernetes.io/worker=worker --overwrite
kubectl label node k8s-worker-01 disk=ssd --overwrite

3. Remote Management

You should rarely run kubectl directly from the cluster nodes. Instead, manage the cluster from your admin workstation (e.g., a MacBook).

Ensure kubectl is installed on your workstation (e.g., via brew install kubectl).
Copy the admin.conf from the Control Plane to your local machine:

scp leva@192.168.1.51:~/.kube/config ./kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig

Run kubectl get nodes from your workstation to verify connectivity to the Virtual IP.

4. Storage Prerequisites (Longhorn)

When deploying Longhorn for persistent storage, make sure all participating nodes are using SSDs. Longhorn on spinning disks is technically possible but practically painful. There should be no throwaway steps or "migration later".

Before committing to the installation, verify your disk types on each node (see how to check if a disk is an SSD):

Worker-01 (Dell): SSD only — good.
CP-01 (ROG): Two disks (e.g., sda is SSD, sdb is HDD). Likely sda is the OS drive and sdb is a secondary HDD probably for bulk storage (media, etc.). Make sure Longhorn is configured to use the SSD (sda) path on the ROG, not the HDD. You can tell Longhorn which path to use per node when you set it up.

Cluster Baseline (Homelab - Phase 3)

1. The CNI Path Mismatch (Flannel vs Containerd)

When installing a Container Network Interface (CNI) like Flannel on a Debian system, you may encounter an issue where pods become permanently stuck in the ContainerCreating state.

If you run kubectl describe pod <name>, you will see a Sandbox error from the kubelet: failed to find plugin "flannel" in path [/usr/lib/cni]

The Root Cause: There is a strict path mismatch between the OS package manager and the upstream project.

Debian's containerd package is compiled to look for CNI plugins in /usr/lib/cni/.
The kube-flannel DaemonSet (and standard CNI networking plugins) install their binaries into /opt/cni/bin/.

The Solution (fix-cni-paths.sh): Rather than modifying the global containerd configuration on every node (which can be overwritten during upgrades), we use a script to generate a symbolic link linking the two directories across the cluster:

# Example from fix-cni-paths.sh
sudo mkdir -p /usr/lib/cni
sudo ln -sf /opt/cni/bin/* /usr/lib/cni/

Once containerd can follow the symlink to find the flannel executable, it successfully provisions the network namespace and the pod transitions to Running.

2. Verifying Overlay Networking

To definitively prove that your CNI is routing packets correctly across physical nodes, you can explicitly force two pods to run on two different nodes and test the connection.

1. Schedule a targeted pod on the Control Plane: By default, standard pods aren't scheduled on the control plane. We can bypass the scheduler using a nodeName override to force a pod onto k8s-cp-01:

kubectl run test-ping --image=busybox --restart=Never --overrides='{"spec": { "nodeName": "k8s-cp-01" }}' -- sleep 3600

2. Ping a pod on the Worker Node: If you have another pod (like test-nginx) running on k8s-worker-01 with IP 10.244.1.3, you can ping it directly from the control plane's test-ping pod:

kubectl exec test-ping -- ping -c 3 10.244.1.3

If you see 0% packet loss, your Flannel overlay network is correctly encapsulating traffic, sending it out the physical enp4s0 interface, routing it over the 192.168.1.0/24 network to the worker node, and decapsulating it back to the pod.

3. SRE: Diagnosing `NotReady` and `NodeStatusUnknown`

When a node drops out of the cluster, Kubernetes reports its condition in kubectl describe node <name>. Two common conditions explain completely different failures:

`NotReady` (with `NetworkPluginNotReady`)

Symptom: The node is alive, the kubelet is running, but the node refuses to accept pods. The reason given is cni plugin not initialized.
Root Cause: The kubelet is waiting for the container runtime (containerd) to confirm that the network is ready. If containerd cannot find the CNI plugins (e.g., because of the Debian /usr/lib/cni path mismatch mentioned above), it reports NetworkReady=false.
The Gotcha: containerd caches its CNI plugin paths on startup! Even if you fix the symlink, the node will stay NotReady forever until you explicitly restart containerd (sudo systemctl restart containerd) so it rescans the directory.

`NodeStatusUnknown` (with `KubeletStopped`)

Symptom: The node was working, but suddenly the Control Plane reports NodeStatusUnknown and stops receiving heartbeat pings.
Root Cause: The kubelet agent on the worker node has completely died or crash-looped.
The Gotcha (The CRI Dependency): Often, the kubelet configuration is perfectly fine, but the container runtime (containerd) has crashed (perhaps due to a disk error or read-only filesystem lock). The kubelet strictly depends on the CRI (Container Runtime Interface) socket located at /var/run/containerd/containerd.sock. If containerd is dead, the socket disappears, and the kubelet intentionally crash-loops until containerd comes back online.

`NotReady` (with `KubeletStopped` due to Swap)

Symptom: The kubelet crashes instantly on boot with the error running with swap on is not supported.
Root Cause: Even if you previously disabled swap (swapoff -a), Debian's systemd auto-generator will dynamically remount swap partitions (like /dev/sda3) on the next reboot if they are still listed in /etc/fstab. Kubernetes strictly forbids swap memory to guarantee accurate pod resource scheduling.
The Fix: You must explicitly remove the swap entry from /etc/fstab (e.g., sed -i '/swap/d' /etc/fstab) to ensure it stays dead across reboots.

4. Bare-Metal Load Balancing (MetalLB)

In a managed cloud environment (AWS, GCP), creating a Service of type: LoadBalancer automatically triggers a cloud API call to provision a physical load balancer and assign a public IP to your cluster.

On bare-metal (like a homelab), this API does not exist. Out-of-the-box Kubernetes does not provide network load balancing. If you create a LoadBalancer service, it will remain in a Pending state indefinitely.

The Solution (MetalLB): MetalLB bridges the gap between Kubernetes and your physical network router.

You allocate a pool of unused IP addresses on your local subnet (e.g., 192.168.1.200-250) that your router's DHCP server will never assign.
MetalLB is configured with this IPAddressPool.
When a LoadBalancer service is created, MetalLB claims an IP from the pool.
Using an L2Advertisement, MetalLB broadcasts ARP packets to the local network, announcing that one of the physical cluster nodes "owns" that IP. The router then correctly forwards traffic to the bare-metal node.

5. Advanced Networking Traps

When deploying complex multi-pod applications (like the Media Automation Stack), you will likely encounter these two common networking traps:

Trap 1: Internal vs. External DNS

Symptom: Pod A (e.g., Radarr) tries to connect to Pod B (e.g., qBittorrent) using its external Ingress URL (http://qbittorrent.homelab.local). The connection fails with Unable to connect.
Root Cause: External domains like .homelab.local are mapped via your workstation's /etc/hosts file or external DNS router. Pods running inside the cluster do not read your workstation's host file.
The Fix: Pods within the same cluster should always communicate using Internal Kubernetes DNS. Instead of the external domain, simply use the name of the Kubernetes Service (e.g., http://qbittorrent:80). CoreDNS automatically resolves service names to their internal ClusterIPs.

Trap 2: The `externalTrafficPolicy: Local` Blackhole

Symptom: You deploy an application with a LoadBalancer service, but when you navigate to the IP address from your browser, the connection times out. However, if you check kubectl get pods, the pod is perfectly healthy.
Root Cause: When a Service is configured with externalTrafficPolicy: Local, it instructs the networking layer (kube-proxy and MetalLB) to only route traffic to a pod if it is running on the exact physical node that received the traffic. If the traffic hits worker-01, but the pod is running on worker-02, the packet is dropped immediately.
The Fix: Change the policy to externalTrafficPolicy: Cluster. This restores default behavior, allowing the receiving node to forward the traffic across the overlay network (Flannel) to whichever node is actually hosting the pod.

Hardware and Storage Extensions

Out-of-the-box Kubernetes only understands CPU, RAM, and basic ephemeral disk space. To utilize advanced hardware and persistent storage, the cluster must be extended.

Hardware Accelerators (GPUs)

Kubernetes cannot natively schedule workloads onto physical GPUs. Instead, it relies on a Device Plugin architecture.

You install a vendor-specific plugin (like the nvidia-container-toolkit) on the host OS and configure the container runtime (containerd) to use it.
You deploy a Device Plugin DaemonSet (e.g., nvidia-device-plugin) into the Kubernetes cluster.
The DaemonSet inspects the physical hardware on each node and advertises available resources back to the Kubernetes API server as extended resources (e.g., nvidia.com/gpu: 1).
You can then request the GPU in your pod manifests exactly like CPU or RAM:

resources:
  limits:
    nvidia.com/gpu: 1

If a pod requests a GPU but none are available (or the plugin failed to load because the driver was missing), the pod will remain stuck in the Pending state with an Insufficient nvidia.com/gpu error.

Storage Classes and Provisioners

Kubernetes decouples storage from the pods using PersistentVolumes (PV) and PersistentVolumeClaims (PVC). A StorageClass defines how that storage is provisioned dynamically.

Local Path Provisioning

The fastest storage available is the physical SSD attached directly to the node. However, Kubernetes doesn't know how to dynamically provision folders on a node's disk out-of-the-box. Using the Rancher Local Path Provisioner, you can create a StorageClass that intercepts PVC requests and automatically creates directories on the host's /opt/local-path-provisioner/ path.

Pros: Blistering fast NVMe/SSD speeds, perfect for databases or media server config directories.
Cons: The data is physically trapped on that specific node. If the pod is rescheduled to a different node, it loses access to the data.

Network File System (NFS)

To share data across the entire cluster so a pod can access it no matter which node it lands on, you need network-attached storage. A classic approach is deploying an NFS Server on a worker node with massive HDD capacity, and exporting it to the cluster subnet.

Pros: Pods can be scheduled anywhere. Supports ReadWriteMany (multiple pods reading/writing the same files simultaneously).
Cons: Significantly slower due to network latency and spinning disk physical limits.

Node Lifecycle Management

Managing a bare-metal Kubernetes cluster requires careful operational procedures, especially when you need to perform physical hardware maintenance.

Safely Evicting Workloads (Draining)

Before you ever pull the physical plug on a bare-metal node, you must safely evict its workloads. If you forcefully power off a node while a stateful pod (like a database) is actively writing to disk, you risk Ext4 filesystem corruption.

The kubectl drain command ensures that all pods are gracefully terminated and rescheduled onto other healthy nodes before the machine goes offline.

kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Cordoning: The drain command first cordons the node (marking it as SchedulingDisabled), preventing new pods from being scheduled there.
Eviction: It then sends a SIGTERM to all running pods, giving them time to gracefully shut down.

Node Shutdown Procedures and Hangs

If you attempt to gracefully shut down a physical bare-metal node (e.g., using shutdown -h now or poweroff), the system may hang indefinitely and fail to power off. When you force a hard reboot, the node might boot into Emergency Mode with a corrupted or Read-Only Ext4 filesystem.

The Cause: When systemd initiates a shutdown, it aggressively terminates network services. However, Kubernetes components (containerd and kubelet) often hang while trying to cleanly detach pod overlay network namespaces or CNI plugins (like Flannel) because the underlying network is already gone. This forces systemd to wait for its 90-second or 5-minute timeout. If the node loses power during this ungraceful wait, the Ext4 journal is not cleanly flushed, leaving a "dirty" flag on the filesystem.
The Fix: To ensure a clean unmount of all container overlays and volumes, you must manually stop the Kubernetes services before issuing the halt command:

systemctl stop kubelet containerd && shutdown -h now

API Timeouts and Script Degradation

Any automation scripts that interact with your cluster rely heavily on the Kubernetes API Server (hosted on the Control Plane). If the Control Plane is offline or shutting down, kubectl commands will hang indefinitely waiting for a response.

To ensure your scripts degrade gracefully when the API is unreachable, always include a timeout flag on non-critical queries:

kubectl --request-timeout=5s get nodes

If the command fails, your script can catch the error and fall back to manual recovery or raw SSH commands instead of crashing completely.

API Automation with Python

While Bash is excellent for managing the raw Linux nodes and starting/stopping Kubernetes components, it falls short when you need to configure complex applications running inside those pods. Modern applications (like Jellyseerr or Prowlarr) use REST APIs to manage their internal state.

When building a zero-touch homelab, you eventually hit a wall where Kubernetes has successfully started the pod, but the app itself still requires you to open a web browser and click through a setup wizard to connect it to other apps.

Bridging Kubernetes and REST APIs

We use Python (with the requests library) to bridge the gap between the Kubernetes infrastructure API and the Application REST APIs.

Instead of hardcoding API keys in our scripts, Python can dynamically reach into the cluster, extract secrets directly from running pods using kubectl exec, and instantly inject them into another pod's REST API.

The Workflow: 1. Extract State: Python calls subprocess.run("kubectl exec -n media deploy/radarr -- cat /config/config.xml") to steal the auto-generated API key from Radarr. 2. Format Data: Python parses the XML/JSON to isolate the exact key string. 3. Inject State: Python uses requests.post() to send that API key directly to Prowlarr's REST API, instantly authenticating the two services without human intervention.

This pattern elevates the homelab from "automated deployment" to "automated configuration," allowing you to destroy and rebuild the entire media stack in minutes without ever opening a web browser.