Skip to content

Kubernetes Storage Architecture

Kubernetes is designed to orchestrate containers, not to manage physical hard drives. Because of this, its storage architecture is intentionally heavily abstracted.

Understanding how Kubernetes bridges the gap between a stateless container and a physical spinning disk is critical for debugging mounting failures.

The Abstraction Layers

To prevent developers from needing to know the IP addresses of storage arrays or the physical mount points of servers, Kubernetes splits storage into two distinct objects:

  1. PersistentVolume (PV): Represents the actual physical storage. This is created by the cluster administrator. It contains the hard technical details (e.g., the NFS server IP address, the iSCSI target, or the local path on a specific node).
  2. PersistentVolumeClaim (PVC): Represents a request for storage. This is created by the developer. It asks for generic requirements (e.g., "I need 100GB of storage that can be read by many pods at once").

When a PVC is created, Kubernetes attempts to bind it to a PV that matches its requirements. Once bound, a Pod can mount the PVC, completely oblivious to the underlying hardware.

The Host Dependency

The most common misconception about Kubernetes storage is that the kubelet has built-in storage drivers. It does not.

When a Pod is scheduled onto a node and requests an NFS volume, the kubelet does not reach out to the NFS server itself. Instead, it delegates the mounting process to the underlying host Operating System (e.g., Debian or Ubuntu).

The kubelet essentially executes a standard Linux mount command on the node. For an NFS volume, this means executing mount -t nfs <server-ip>:/path /var/lib/kubelet/pods/....

The fsconfig() failed Trap

Because the mount is delegated to the host OS, the host OS must possess the required storage client utilities.

If you attempt to mount an NFS volume on a worker node that does not have the nfs-common package installed, the host's mount command will fall back to using the raw kernel fsconfig() syscall because it cannot find the mount.nfs helper program. The kernel cannot natively parse the IP:/path syntax, resulting in a cryptic exit status 32 error and the pod becoming permanently stuck in ContainerCreating.

This principle applies to all network storage:

  • To mount NFS, the node must have nfs-common.
  • To mount Ceph/RBD, the node must have ceph-common.
  • To mount iSCSI, the node must have open-iscsi.

Network Storage vs. Local Storage

Local Storage (local-path)

When using local node storage (like the Rancher local-path-provisioner), the data is physically written to the SSD of a specific worker node.

  • Advantage: Blazing fast I/O performance.
  • Disadvantage: If the Pod crashes and is rescheduled to a different node, it loses access to its data. Local storage inextricably ties a Pod to a specific physical machine.

Network Storage (NFS/Ceph)

When using network attached storage, the data lives on a central server and is mounted over the network.

  • Advantage: Pods can be rescheduled to any node in the cluster and still access their data. Multiple pods can read and write to the same volume simultaneously (ReadWriteMany).
  • Disadvantage: Network latency slows down I/O operations, making it unsuitable for high-performance databases.

Static vs. Dynamic Provisioning

Static Provisioning (The Old Way)

In static provisioning, an administrator manually creates PersistentVolume objects. When a developer creates a PVC, Kubernetes tries to find an existing PV that fits. If none exist, the PVC stays in a "Pending" state forever. This requires constant manual intervention from the cluster administrator.

Dynamic Provisioning (The Modern Way)

Modern clusters use Dynamic Provisioning. Instead of creating PVs manually, the administrator installs a Storage Provisioner and defines a StorageClass.

When a developer creates a PVC requesting a specific StorageClass, the provisioner intercepts the request, automatically creates the physical storage (e.g., creates a new directory on an NFS server, or an EBS volume in AWS), and then automatically generates the PV and binds it to the PVC.

In your bare-metal homelab, you use two types of dynamic provisioners:

  1. Local Path Provisioner (local-path): Automatically creates folders on the high-speed local SSDs.
  2. NFS Subdir External Provisioner (nfs-client): Automatically creates sub-directories on a central NFS server.

Common Storage Traps

The NAS root_squash Provisioning Trap

When using the NFS Subdir External Provisioner with a dedicated NAS like TrueNAS, the provisioner itself runs as the root user to dynamically create directories on the NFS share. By default, TrueNAS maps all incoming root connections to the unprivileged nobody user for security ("Root Squash").

Because the provisioner is squashed to nobody, it receives a Permission Denied error when trying to create a directory for a new PVC, leaving the PVC permanently stuck in Pending.

To fix this, you must configure your NAS NFS export to explicitly map the root user to root (disabling Root Squash for that specific share), and ensure the underlying ZFS dataset has permissive Unix permissions or ACLs allowing root access.

The Container Application Permission Trap

When using the NFS Subdir External Provisioner, the provisioner dynamically creates a new folder on the NFS server for each PVC. By default, these folders are created as root:root with standard permissions (e.g., 0755).

Many popular container images (such as the linuxserver.io ecosystem) drop root privileges at startup and run as an unprivileged internal user (often UID 1000 or user abc). When this unprivileged container attempts to write to its newly mounted NFS volume, it receives a Permission Denied error because it cannot write to a root-owned directory.

To fix this, you must either: 1. SSH into the NFS server and manually change the directory ownership to match the container's UID (e.g., chown -R 1000:1000 /mnt/media). 2. Use an initContainer running as root to chown the mounted volume before the main application starts.

The PV claimRef Deadlock

When a PersistentVolume is created with a ReclaimPolicy of Retain, Kubernetes intentionally refuses to delete the physical data when the PVC is deleted.

However, this creates a security deadlock: if you delete the PVC, and then create a new PVC with the exact same name, the PV will refuse to bind to it. The PV remembers the internal UID of the original PVC in its claimRef field. It assumes the new PVC is an imposter trying to hijack the old data.

To reuse a Retained PV, you must forcefully wipe its memory of the old PVC: kubectl patch pv <pv-name> -p '{"spec":{"claimRef":null}}'

Orphaned Node-Bound PVCs

When using local-path provisioner (or any host-bound storage), the PV is hard-coded with a nodeAffinity rule specifying the exact physical server where the data lives.

If that physical server crashes and dies permanently, the pod will be rescheduled to a healthy node. However, the pod will stay in Pending forever with a volume node affinity conflict error. The pod demands the dead node because it is still bound to the old PVC, which is permanently locked to the dead node.

You must delete the old PVC (abandoning the lost data) and restart the pod so the provisioner generates a brand new PVC bound to the healthy node.