Ansible
Ansible is a powerful, declarative Configuration Management and Infrastructure as Code (IaC) tool. Unlike Bash scripting (which is imperative), Ansible allows you to declare the desired state of your servers, and it handles the how to get them there.
Core Concepts
Declarative vs. Imperative
- Imperative (Bash): You write exact commands to execute (
apt-get install -y containerd). If you run it twice, it might fail or create unintended side effects. - Declarative (Ansible): You declare the state you want (
containerd must be installed). Ansible checks the current state; if it's already installed, it does nothing. This is called Idempotency.
Inventory
The Inventory is a file (typically hosts.yaml or hosts.ini) that defines the servers Ansible will manage. It groups servers by role, allowing you to target specific machines easily.
Example hosts.yaml:
all:
children:
control_plane:
hosts:
k8s-cp-01:
ansible_host: 192.168.1.51
workers:
hosts:
k8s-worker-01:
ansible_host: 192.168.1.52
vars:
ansible_user: leva
Playbooks & Tasks
A Playbook is a YAML file containing a list of Tasks to execute on the hosts defined in your inventory. A task maps to an Ansible module (like apt, systemd, or file).
Example Task:
- name: Ensure containerd is installed
ansible.builtin.apt:
name: containerd
state: present
update_cache: yes
Handling Modern Ansible Warnings
When running Ansible, you might encounter warnings due to changes in the ecosystem or Python environment.
Python Interpreter Discovery
Ansible connects via SSH and attempts to find the Python interpreter on the target host. If it's unsure which version to use, it warns you.
- Fix: Explicitly define
ansible_python_interpreter: /usr/bin/python3in your inventory variables.
Legacy Facts
Ansible gathers "facts" (system information) about the target host before running tasks. Older versions of Ansible injected these facts directly as variables (e.g., ansible_distribution). Modern Ansible namespaces them under the ansible_facts dictionary (e.g., ansible_facts['distribution']).
- Fix: Set
inject_facts_as_vars = Falsein youransible.cfgto disable legacy fact injection and remove the warning. You must then update your playbooks to use theansible_factsdictionary.
Practical Playbook Patterns
The SSH "Chicken and Egg" Problem
Ansible relies entirely on SSH to connect to remote nodes. However, a completely fresh bare-metal server only has password authentication enabled, and Ansible doesn't natively know your password unless you tell it.
- The Problem: You want to automate SSH key distribution so Ansible can run securely without passwords, but you can't run Ansible until the keys are distributed!
- The Solution (
sshpass): You can install thesshpassutility on your control machine, and pass--ask-pass(or-k) toansible-playbook. This allows Ansible to temporarily use password authentication to connect to the fresh node, run theansible.posix.authorized_keymodule to inject your public key, and then exit. All future playbooks can then run passwordless!
Forcing Shell Idempotency
While Ansible's native modules (like apt or file) are perfectly idempotent, you sometimes have to fall back to the ansible.builtin.shell or command modules to run custom bash scripts. By default, Ansible assumes a shell task always changes the system, ruining your idempotency.
- Fix 1 (
creates): If your shell command generates a file, pass thecreates: /path/to/fileargument. Ansible will skip the task if the file already exists. - Fix 2 (
changed_when: false): If your command is purely structural (like creating a symlinkln -sf), you can appendchanged_when: falseto the task. This tells Ansible "run this command, but don't report it as a system change."
Handling Handlers and State
Ansible Handlers are special tasks that only run when notified by another task that has resulted in a "changed" state. This is highly efficient for restarting services (like containerd or nginx) only when their configuration file is actually modified.
- The Problem: If you use the
ansible.builtin.commandorshellmodules to modify a configuration (e.g.,nvidia-ctk runtime configure), and you try to control its change state usingchanged_when, a simple typo in your condition (likechanged_when: "'Configuring' in stdout") will evaluate tofalse. Ansible will assume the task made no changes and will silently skip notifying your handler, leaving the service running with stale configuration. - The Solution: Always test your
changed_whenconditions thoroughly. If you are unsure of the exactstdoutstring, it is often safer to rely on the return code:changed_when: my_command.rc == 0.
Privilege Escalation (become and -K)
By default, Ansible runs tasks as the SSH user you connected with (e.g., leva). If a task requires root privileges (like installing a package), you must add become: yes to the task or playbook.
However, if the remote user's /etc/sudoers configuration requires a password to execute sudo (i.e., they do not have NOPASSWD:ALL), the playbook will immediately crash with a Missing sudo password error.
- The Fix: You must pass the
-K(or--ask-become-pass) flag to youransible-playbookoransiblead-hoc command. This will prompt you in your terminal to securely type the sudo password before execution begins, allowing Ansible to successfully elevate privileges.
Ansible Ad-Hoc Commands
While playbooks are great for defining state, sometimes you just need to execute a one-off task across a fleet of servers immediately—like initiating a cluster-wide shutdown. Instead of writing a full playbook, you can use an Ad-Hoc command:
-bmeans "become" (run with sudo privileges).-Kprompts you for the sudo password.-m commandspecifies the module to run.-a "..."provides the arguments to the module.
Wrapping Ad-Hoc Commands in Shell Scripts
While Ad-Hoc commands are powerful, typing them out repeatedly for complex recovery scenarios (like fixing a locked filesystem) is error-prone. A common DevOps pattern is to wrap Ansible Ad-Hoc commands inside a Bash script.
This gives you the best of both worlds: the declarative, remote-execution power of Ansible, with the dynamic variable parsing (like reading .env files) and conditional logic of Bash.
#!/usr/bin/env bash
NODE=$1
# Parse the sudo password dynamically from a .env file to bypass the -K prompt
PASS=$(grep "PASS_$NODE" .env | cut -d '=' -f2)
echo "Recovering node $NODE..."
ansible $NODE -i inventory.yaml -b -m shell -a "systemctl restart kubelet" -e "ansible_become_pass=$PASS"
Asynchronous Execution
Sometimes an ad-hoc command needs to be "fire-and-forget", such as initiating a cluster shutdown sequence where you don't want Ansible hanging while waiting for the node to fully power off. You can run tasks asynchronously using -B (background) and -P (polling).
-B 1tells Ansible to run the job in the background with a maximum timeout of 1 second.-P 0tells Ansible to never poll for the result. It fires the command and immediately disconnects, letting the command run to completion on its own.
Safe Playbook Testing (Sandboxing)
Developing Ansible playbooks directly against production or bare-metal nodes is risky. A poorly written task could easily corrupt a production OS or bring down a cluster.
The industry best practice is to test Infrastructure-as-Code (IaC) against disposable sandboxes before ever running them on real hardware.
- The Solution (Vagrant + VirtualBox): You can scaffold a local, disposable environment (like a
labs/directory) containing aVagrantfile. This allows you to quickly spin up a local VM (vagrant up) that perfectly mirrors your production OS (e.g., Debian 13). - You can write, test, and break your playbook against the local VM. Once it is perfectly idempotent and flawless, you simply point your inventory file at the real bare-metal cluster and deploy with confidence.
Handling Disconnects
When performing operations that modify the network or power state (like issuing a shutdown command or restarting the SSH daemon), you may encounter an edge case where Ansible throws an UNREACHABLE or Connection Refused error.
This happens because the remote host successfully processed the command and immediately severed the SSH socket before Ansible could receive the "success" return code. In bash scripts, you can safely catch these expected disconnects by appending || true to your ad-hoc commands, ensuring your automation doesn't crash from a false-positive failure.
Handling Package Manager Overwrites (The sqv Bug)
When you deploy custom wrappers over system binaries (e.g., bypassing Debian 13's strict sqv Sequoia OpenPGP checks by wrapping the binary), standard package manager upgrades (like apt-get upgrade) will often overwrite your wrapper with the newly compiled upstream binary. This immediately breaks future Ansible apt tasks.
- The Solution (
pre_tasks): In your playbooks, define apre_tasksblock that explicitly checks for your wrapper's existence and forces its re-application before anyrolesortasksexecute that rely on the package manager.
Bypassing Interactive Prompts (vars_prompt)
While vars_prompt is great for securing passwords manually, it fundamentally breaks CI/CD and fully automated workflows like Makefiles, because it halts execution waiting for human input.
- The Solution (
--extra-vars+.env): Removevars_promptentirely. Instead, store your secrets in a local.envfile (which is strictly.gitignore'd). Use aMakefileor wrapper script to source the.envfile and pass the secrets securely as command-line arguments:
Troubleshooting Common Errors
UNREACHABLE does not always mean network failure
A classic Ansible error is:
UNREACHABLE! => {"msg": "Failed to create temporary directory... exited with result 1"}
While this often implies the remote node is powered off or behind a firewall, if you are actively able to SSH into the node manually, this error means Ansible was rejected from writing to the disk.
- The disk is 100% full:
mkdirfails because there are zero bytes remaining. - Read-Only Lockdown: The remote Linux kernel detected a hardware fault (like a corrupted SSD) and forcefully mounted the root partition as Read-Only to prevent data loss. Ansible cannot write its standard modules to
~/.ansible/tmp. - Permissions: The
~/.ansibledirectory on the remote node is accidentally owned byroot.
Asynchronous Tasks on Read-Only Filesystems
When you run an asynchronous task (-B 1), Ansible does not use the standard ~/.ansible/tmp directory. Instead, it writes job tracking files to ~/.ansible_async (e.g., /root/.ansible_async if running with become: true).
If you are attempting to run an emergency shutdown script on a node that has fallen into a Read-Only lockdown, the async task will crash with [Errno 30] Read-only file system because it cannot create the tracking directory on the locked root drive.
- The Fix: Explicitly pass the
ansible_async_dirvariable and point it to the/tmpdirectory. Since/tmpis often mounted entirely in RAM (tmpfs), it remains writable even when the underlying SSD locks itself.
macOS Sandbox Permissions (Local Temp Error)
If you run Ansible from a macOS control node (like a MacBook), you may suddenly encounter:
Unhandled exception when retrieving 'DEFAULT_LOCAL_TMP'
- The Cause: macOS implements aggressive sandboxing and SIP (System Integrity Protection). In some environments or after OS updates, Python/Ansible is blocked from creating temporary script files in the default
~/.ansible/tmpdirectory on your local machine. - The Fix: Create an
ansible.cfgfile at the root of your project to override both local and remote temp directories to use the globally writable/tmp/directory: