Ansible

Ansible is a powerful, declarative Configuration Management and Infrastructure as Code (IaC) tool. Unlike Bash scripting (which is imperative), Ansible allows you to declare the desired state of your servers, and it handles the how to get them there.

Core Concepts

Declarative vs. Imperative

Imperative (Bash): You write exact commands to execute (apt-get install -y containerd). If you run it twice, it might fail or create unintended side effects.
Declarative (Ansible): You declare the state you want (containerd must be installed). Ansible checks the current state; if it's already installed, it does nothing. This is called Idempotency.

Inventory

The Inventory is a file (typically hosts.yaml or hosts.ini) that defines the servers Ansible will manage. It groups servers by role, allowing you to target specific machines easily.

Example hosts.yaml:

all:
  children:
    control_plane:
      hosts:
        k8s-cp-01:
          ansible_host: 192.168.1.51
    workers:
      hosts:
        k8s-worker-01:
          ansible_host: 192.168.1.52
  vars:
    ansible_user: leva

Playbooks & Tasks

A Playbook is a YAML file containing a list of Tasks to execute on the hosts defined in your inventory. A task maps to an Ansible module (like apt, systemd, or file).

Example Task:

- name: Ensure containerd is installed
  ansible.builtin.apt:
    name: containerd
    state: present
    update_cache: yes

Handling Modern Ansible Warnings

When running Ansible, you might encounter warnings due to changes in the ecosystem or Python environment.

Python Interpreter Discovery

Ansible connects via SSH and attempts to find the Python interpreter on the target host. If it's unsure which version to use, it warns you.

Fix: Explicitly define ansible_python_interpreter: /usr/bin/python3 in your inventory variables.

Legacy Facts

Ansible gathers "facts" (system information) about the target host before running tasks. Older versions of Ansible injected these facts directly as variables (e.g., ansible_distribution). Modern Ansible namespaces them under the ansible_facts dictionary (e.g., ansible_facts['distribution']).

Fix: Set inject_facts_as_vars = False in your ansible.cfg to disable legacy fact injection and remove the warning. You must then update your playbooks to use the ansible_facts dictionary.

Practical Playbook Patterns

The SSH "Chicken and Egg" Problem

Ansible relies entirely on SSH to connect to remote nodes. However, a completely fresh bare-metal server only has password authentication enabled, and Ansible doesn't natively know your password unless you tell it.

The Problem: You want to automate SSH key distribution so Ansible can run securely without passwords, but you can't run Ansible until the keys are distributed!
The Solution (sshpass): You can install the sshpass utility on your control machine, and pass --ask-pass (or -k) to ansible-playbook. This allows Ansible to temporarily use password authentication to connect to the fresh node, run the ansible.posix.authorized_key module to inject your public key, and then exit. All future playbooks can then run passwordless!

Forcing Shell Idempotency

While Ansible's native modules (like apt or file) are perfectly idempotent, you sometimes have to fall back to the ansible.builtin.shell or command modules to run custom bash scripts. By default, Ansible assumes a shell task always changes the system, ruining your idempotency.

Fix 1 (creates): If your shell command generates a file, pass the creates: /path/to/file argument. Ansible will skip the task if the file already exists.
Fix 2 (changed_when: false): If your command is purely structural (like creating a symlink ln -sf), you can append changed_when: false to the task. This tells Ansible "run this command, but don't report it as a system change."

Handling Handlers and State

Ansible Handlers are special tasks that only run when notified by another task that has resulted in a "changed" state. This is highly efficient for restarting services (like containerd or nginx) only when their configuration file is actually modified.

The Problem: If you use the ansible.builtin.command or shell modules to modify a configuration (e.g., nvidia-ctk runtime configure), and you try to control its change state using changed_when, a simple typo in your condition (like changed_when: "'Configuring' in stdout") will evaluate to false. Ansible will assume the task made no changes and will silently skip notifying your handler, leaving the service running with stale configuration.
The Solution: Always test your changed_when conditions thoroughly. If you are unsure of the exact stdout string, it is often safer to rely on the return code: changed_when: my_command.rc == 0.

Privilege Escalation (`become` and `-K`)

By default, Ansible runs tasks as the SSH user you connected with (e.g., leva). If a task requires root privileges (like installing a package), you must add become: yes to the task or playbook.

However, if the remote user's /etc/sudoers configuration requires a password to execute sudo (i.e., they do not have NOPASSWD:ALL), the playbook will immediately crash with a Missing sudo password error.

The Fix: You must pass the -K (or --ask-become-pass) flag to your ansible-playbook or ansible ad-hoc command. This will prompt you in your terminal to securely type the sudo password before execution begins, allowing Ansible to successfully elevate privileges.

Ansible Ad-Hoc Commands

While playbooks are great for defining state, sometimes you just need to execute a one-off task across a fleet of servers immediately—like initiating a cluster-wide shutdown. Instead of writing a full playbook, you can use an Ad-Hoc command:

ansible workers -i hosts.yaml -b -K -m command -a "shutdown -h now"

-b means "become" (run with sudo privileges).
-K prompts you for the sudo password.
-m command specifies the module to run.
-a "..." provides the arguments to the module.

Wrapping Ad-Hoc Commands in Shell Scripts

While Ad-Hoc commands are powerful, typing them out repeatedly for complex recovery scenarios (like fixing a locked filesystem) is error-prone. A common DevOps pattern is to wrap Ansible Ad-Hoc commands inside a Bash script.

This gives you the best of both worlds: the declarative, remote-execution power of Ansible, with the dynamic variable parsing (like reading .env files) and conditional logic of Bash.

#!/usr/bin/env bash
NODE=$1
# Parse the sudo password dynamically from a .env file to bypass the -K prompt
PASS=$(grep "PASS_$NODE" .env | cut -d '=' -f2)

echo "Recovering node $NODE..."
ansible $NODE -i inventory.yaml -b -m shell -a "systemctl restart kubelet" -e "ansible_become_pass=$PASS"

Asynchronous Execution

Sometimes an ad-hoc command needs to be "fire-and-forget", such as initiating a cluster shutdown sequence where you don't want Ansible hanging while waiting for the node to fully power off. You can run tasks asynchronously using -B (background) and -P (polling).

ansible workers -i hosts.yaml -b -m shell -a "sleep 2 && shutdown -h now" -B 1 -P 0

-B 1 tells Ansible to run the job in the background with a maximum timeout of 1 second.
-P 0 tells Ansible to never poll for the result. It fires the command and immediately disconnects, letting the command run to completion on its own.

Safe Playbook Testing (Sandboxing)

Developing Ansible playbooks directly against production or bare-metal nodes is risky. A poorly written task could easily corrupt a production OS or bring down a cluster.

The industry best practice is to test Infrastructure-as-Code (IaC) against disposable sandboxes before ever running them on real hardware.

The Solution (Vagrant + VirtualBox): You can scaffold a local, disposable environment (like a labs/ directory) containing a Vagrantfile. This allows you to quickly spin up a local VM (vagrant up) that perfectly mirrors your production OS (e.g., Debian 13).
You can write, test, and break your playbook against the local VM. Once it is perfectly idempotent and flawless, you simply point your inventory file at the real bare-metal cluster and deploy with confidence.

Handling Disconnects

When performing operations that modify the network or power state (like issuing a shutdown command or restarting the SSH daemon), you may encounter an edge case where Ansible throws an UNREACHABLE or Connection Refused error.

This happens because the remote host successfully processed the command and immediately severed the SSH socket before Ansible could receive the "success" return code. In bash scripts, you can safely catch these expected disconnects by appending || true to your ad-hoc commands, ensuring your automation doesn't crash from a false-positive failure.

Handling Package Manager Overwrites (The `sqv` Bug)

When you deploy custom wrappers over system binaries (e.g., bypassing Debian 13's strict sqv Sequoia OpenPGP checks by wrapping the binary), standard package manager upgrades (like apt-get upgrade) will often overwrite your wrapper with the newly compiled upstream binary. This immediately breaks future Ansible apt tasks.

The Solution (pre_tasks): In your playbooks, define a pre_tasks block that explicitly checks for your wrapper's existence and forces its re-application before any roles or tasks execute that rely on the package manager.

Bypassing Interactive Prompts (`vars_prompt`)

While vars_prompt is great for securing passwords manually, it fundamentally breaks CI/CD and fully automated workflows like Makefiles, because it halts execution waiting for human input.

The Solution (--extra-vars + .env): Remove vars_prompt entirely. Instead, store your secrets in a local .env file (which is strictly .gitignore'd). Use a Makefile or wrapper script to source the .env file and pass the secrets securely as command-line arguments:

ansible-playbook playbook.yaml -e "tailscale_auth_key=${TAILSCALE_AUTH_KEY}"

Troubleshooting Common Errors

`UNREACHABLE` does not always mean network failure

A classic Ansible error is: UNREACHABLE! => {"msg": "Failed to create temporary directory... exited with result 1"}

While this often implies the remote node is powered off or behind a firewall, if you are actively able to SSH into the node manually, this error means Ansible was rejected from writing to the disk.

The disk is 100% full: mkdir fails because there are zero bytes remaining.
Read-Only Lockdown: The remote Linux kernel detected a hardware fault (like a corrupted SSD) and forcefully mounted the root partition as Read-Only to prevent data loss. Ansible cannot write its standard modules to ~/.ansible/tmp.
Permissions: The ~/.ansible directory on the remote node is accidentally owned by root.

Asynchronous Tasks on Read-Only Filesystems

When you run an asynchronous task (-B 1), Ansible does not use the standard ~/.ansible/tmp directory. Instead, it writes job tracking files to ~/.ansible_async (e.g., /root/.ansible_async if running with become: true).

If you are attempting to run an emergency shutdown script on a node that has fallen into a Read-Only lockdown, the async task will crash with [Errno 30] Read-only file system because it cannot create the tracking directory on the locked root drive.

The Fix: Explicitly pass the ansible_async_dir variable and point it to the /tmp directory. Since /tmp is often mounted entirely in RAM (tmpfs), it remains writable even when the underlying SSD locks itself.

ansible all -b -m shell -e ansible_async_dir=/tmp/.ansible_async -a "shutdown -h now" -B 1 -P 0

macOS Sandbox Permissions (Local Temp Error)

If you run Ansible from a macOS control node (like a MacBook), you may suddenly encounter: Unhandled exception when retrieving 'DEFAULT_LOCAL_TMP'

The Cause: macOS implements aggressive sandboxing and SIP (System Integrity Protection). In some environments or after OS updates, Python/Ansible is blocked from creating temporary script files in the default ~/.ansible/tmp directory on your local machine.
The Fix: Create an ansible.cfg file at the root of your project to override both local and remote temp directories to use the globally writable /tmp/ directory:

[defaults]
local_tmp = /tmp/ansible-local
remote_tmp = /tmp/ansible-remote