Skip to content

Bare-Metal Kubernetes Provisioning

While managed cloud providers (like EKS, GKE, or AKS) hide the complexity of cluster creation behind a single button click, building a bare-metal Kubernetes cluster from scratch exposes you to the true architecture of the system.

This chapter details the exact lifecycle and architecture of bootstrapping a bare-metal cluster using tools like kubeadm and declarative automation (Ansible).

1. The OS Foundation

Kubernetes does not run in an isolated virtual machine; it runs directly on top of the host Linux kernel. The kernel is the execution environment. Therefore, the bare-metal OS must be carefully tuned.

Disabling Swap

By default, the kubelet (the Kubernetes node agent) will refuse to start if swap memory is enabled on the host. Kubernetes relies on precise memory accounting to schedule pods efficiently. If the Linux kernel silently moves container memory to a slow swap disk, the cluster's resource metrics become inaccurate, and applications will thrash unpredictably. You must disable swap completely (swapoff -a) and mask it in systemd to prevent it from automatically remounting.

Kernel Modules and Sysctls

Kubernetes requires specific networking capabilities from the host kernel:

  • overlay module: Required by the container runtime to use OverlayFS, which layers read-only container images with writable ephemeral layers.
  • br_netfilter module & ip_forward sysctl: Required to allow bridged network traffic between containers to pass through iptables for NetworkPolicy enforcement and routing.

2. The Container Runtime (CRI)

Kubernetes itself does not run containers. It delegates that job to a Container Runtime Interface (CRI) compatible engine, such as containerd.

Systemd Cgroups

Both the Linux init system (systemd) and the container runtime (containerd) try to manage resource limits (cgroups). If they fight over who manages what, the system becomes unstable. You must explicitly configure containerd to delegate cgroup management to systemd by setting SystemdCgroup = true in /etc/containerd/config.toml.

The Kubelet / CRI Dependency

The kubelet agent on the node communicates with containerd via a unix socket (/var/run/containerd/containerd.sock). If containerd crashes or goes offline (due to a disk error or misconfiguration), the kubelet will instantly crash-loop, and the node will fall into a NodeStatusUnknown state in the cluster. The kubelet cannot function without its runtime.

3. Cluster Bootstrap (Kubeadm)

With the OS and runtime prepared, the actual cluster is forged using kubeadm.

Control Plane Initialization

Running kubeadm init on the first node creates the "Control Plane". It generates the cryptographic certificates (PKI), starts the API Server, etcd database, Controller Manager, and Scheduler as static pods, and outputs a kubeadm join token.

High Availability (VIP)

If you plan to have multiple control plane nodes, you cannot point your worker nodes to a single IP address (if that node dies, the cluster API dies). Instead, you deploy a Virtual IP (VIP) using a tool like kube-vip. The VIP floats between healthy control plane nodes, ensuring the cluster is always reachable at a single, stable IP address.

4. The Container Network Interface (CNI)

When a node successfully joins the cluster via kubeadm join, it will appear in the kubectl get nodes list as NotReady.

This is because a fresh cluster has no idea how to route IP packets between pods on different physical machines. You must install a CNI plugin (like Flannel, Cilium, or Calico).

How CNI Works

  1. You deploy the CNI plugin (usually as a DaemonSet, running one pod on every node).
  2. The CNI pod copies network binaries (like bridge, portmap, flannel) into /opt/cni/bin/ on the host machine.
  3. It writes a configuration file into /etc/cni/net.d/.
  4. containerd reads this directory, initializes the network, and tells the kubelet that the network is ready.
  5. The kubelet reports back to the API Server, and the node flips to Ready.

Path Mismatches

A common pitfall on Debian-based systems is a strict path mismatch. Debian's containerd package looks for CNI plugins in /usr/lib/cni/, but standard CNI installers place them in /opt/cni/bin/.

If containerd cannot find the binaries, it throws a cni plugin not initialized error, and the node is permanently stuck in NotReady. This is resolved by overriding the containerd configuration or creating a symlink between the directories.

5. Power State Operations

One of the unique challenges of bare-metal Kubernetes is that the nodes do not have a hypervisor API (like AWS EC2 or VMware vSphere) that Kubernetes can call to programmatically shut them down or reboot them.

The Two-Layer Shutdown

Because Kubernetes has no control over the physical power state, powering off a cluster requires a coordinated "Two-Layer Shutdown" approach:

  1. Kubernetes Eviction (Layer 1): You must first interact with the Kubernetes API to gracefully drain the workloads from the nodes (kubectl drain). This ensures distributed applications can failover without data corruption.
  2. OS Kernel Halt (Layer 2): Once the node is empty, you must bypass Kubernetes entirely and interact directly with the Linux OS (via SSH or Ansible) to issue the shutdown -h now command. This flushes the kernel's disk cache and cleanly unmounts the filesystems.

If you skip Layer 1, you corrupt your databases. If you skip Layer 2 and just rip the power cord out, you corrupt your Ext4 filesystem.

The Systemd Hang (Dirty Ext4 Journals)

Even if you execute the two-layer shutdown perfectly, bare-metal nodes are highly susceptible to a specific shutdown hang caused by the container runtime.

When you issue shutdown -h now, the Linux systemd process begins forcefully tearing down the network stack. However, containerd and the kubelet may still be attempting to cleanly unmount and detach the overlay network namespaces (like Flannel) for the evicted pods. Because the underlying network interfaces are already dead, containerd hangs indefinitely.

This forces systemd to wait for its 90-second or 5-minute timeout. If the node loses power during this ungraceful systemd wait period (e.g., you hard-reboot it out of frustration), the OS kernel never gets the chance to cleanly flush the Ext4 journal to disk.

When the node boots back up, the kernel detects the "dirty" journal, assumes the filesystem is corrupt, and drops the machine into Emergency Mode with a Read-Only filesystem, effectively bricking the node until manual fsck recovery.

Because the filesystem is locked in Read-Only mode, containerd fails to start (it crashes complaining it cannot chmod /var/lib/containerd). Because containerd is dead, the Container Runtime Interface (CRI) socket disappears. This causes the kubelet to immediately crash-loop. Finally, the Control Plane sees the kubelet stop responding and marks the node as NodeStatusUnknown.

This demonstrates the complete cascading failure: Hard Power Loss -> Ext4 Dirty Journal -> Read-Only Lockout -> Containerd Crash -> Kubelet Crash -> K8s Node Failure.

To prevent this on bare-metal K8s, always explicitly stop the Kubernetes services before issuing the OS halt:

systemctl stop kubelet containerd && shutdown -h now

Return to Service (Uncordon)

After maintenance is complete and the node boots back up, the kubelet will automatically rejoin the cluster. However, the node remains cordoned (unschedulable) from the earlier kubectl drain. The scheduler will not place any new pods on it until you explicitly mark it as schedulable again:

# Single node
kubectl uncordon k8s-worker-01

# Multiple nodes at once
kubectl uncordon k8s-worker-01 k8s-worker-02

This completes the full bare-metal maintenance lifecycle:

drain  →  stop services  →  shutdown  →  maintenance  →  boot  →  uncordon

Note: Uncordoning does not automatically rebalance existing pods onto the node. It only allows the scheduler to place new pods there. Existing pods on other nodes will stay where they are unless evicted or rescheduled.