Embracing the False Positive: Handling SSH Drops in Ansible
In the world of Infrastructure as Code, we are conditioned to treat every red line of text as a catastrophic failure. If Ansible prints FAILED!, something is broken, right?
Not always. Sometimes, an automation tool fails precisely because it succeeded too well.
The Shutdown Anomaly
I was writing a safe-shutdown.sh script to gracefully power off my bare-metal Kubernetes cluster. After safely draining the workloads via the Kube API, the script triggered an Ansible ad-hoc command to issue a hardware power-off over SSH:
Every time I ran it, the workers successfully powered down. But my terminal was filled with terrifying red errors:
fatal: [k8s-worker-01]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh... Connection refused"}
The Race Condition
Why was Ansible reporting an UNREACHABLE error for a command that I knew worked perfectly?
It comes down to a race condition between the OS and the SSH daemon. When Ansible executes a command, it waits over the open SSH socket for an exit code (e.g., 0 for success).
However, shutdown -h now tells the Linux kernel to immediately halt the system. The machine physically powers off faster than Ansible can receive the success code. As the machine dies, it abruptly severs the SSH socket. Ansible sees an unexpectedly dropped connection and classifies it as a total failure.
The Fix
In automation, context is everything. Because we expect the connection to drop, this error is actually a false positive. We need to tell our scripts to embrace it.
If you are wrapping Ansible commands in Bash, you can simply append || true to suppress the failure and allow the script to continue to the next machine:
If you are using Ansible Playbooks, you have two cleaner options. You can explicitly ignore the error:
Or, even better, you can use the dedicated modules that are specifically programmed to expect and handle SSH disconnections gracefully:
Key Takeaway
Not all errors are bugs. When writing infrastructure automation that alters the state of the network or the power of the machine itself, remember that "Connection refused" might just mean your command worked perfectly.