Skip to content

How to Safely Shut Down a Bare-Metal K8s Cluster

Unlike managed Kubernetes services (like EKS or GKE) where underlying infrastructure is abstracted, managing a bare-metal cluster means you are directly responsible for stateful workloads and hardware power states. Hard-powering off nodes can lead to corrupted etcd data or orphaned Persistent Volumes.

In our homelab setup, we use a "Two-Layer" graceful shutdown strategy automated via the safe-shutdown.sh script.

The Shutdown Strategy

The shutdown process operates in two distinct phases to ensure software state is saved before hardware state is altered.

Layer 1: Software Drain (kubectl)

Before turning off any machines, we must tell Kubernetes to safely evict running applications.

  1. Identify Workers: We query the Kubernetes API to list all nodes that are not the Control Plane.
  2. Cordon and Drain: For each worker node, we run:
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
    
    This commands the Kubelet to cleanly terminate pods and gives them time to finish processing requests, rather than just pulling the plug.

Layer 2: Hardware Power-Off (Ansible)

Once the cluster is safely drained, we issue hardware shutdown commands over SSH. Since we manage configuration via Ansible, we leverage its ad-hoc commands.

  1. Shutdown Workers: We target the workers group in our Ansible inventory:
    ansible workers -i ansible/inventory/hosts.yaml -K -b -m command -a "shutdown -h now"
    
  2. Shutdown Control Plane: We shut down the Control Plane node last. This ensures the Kubernetes API remains available to manage the worker drains up until the final moment.
    ansible control_plane -i ansible/inventory/hosts.yaml -K -b -m command -a "shutdown -h now"
    

Running the Process

To safely turn off the Homelab cluster, navigate to the homelab repository on your dev-station and run:

make shutdown-cluster
# or execute the script directly
./scripts/safe-shutdown.sh

You will be prompted to confirm the action. Once confirmed, the script checks for your kubeconfig, safely drains the workers, and issues the shutdown commands. Wait a few moments for the physical machines to halt before unplugging power.