Skip to content

Embracing the False Positive: Handling SSH Drops in Ansible

In the world of Infrastructure as Code, we are conditioned to treat every red line of text as a catastrophic failure. If Ansible prints FAILED!, something is broken, right?

Not always. Sometimes, an automation tool fails precisely because it succeeded too well.

The Shutdown Anomaly

I was writing a safe-shutdown.sh script to gracefully power off my bare-metal Kubernetes cluster. After safely draining the workloads via the Kube API, the script triggered an Ansible ad-hoc command to issue a hardware power-off over SSH:

ansible workers -i hosts.yaml -m command -a "shutdown -h now"

Every time I ran it, the workers successfully powered down. But my terminal was filled with terrifying red errors:

fatal: [k8s-worker-01]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh... Connection refused"}

The Race Condition

Why was Ansible reporting an UNREACHABLE error for a command that I knew worked perfectly?

It comes down to a race condition between the OS and the SSH daemon. When Ansible executes a command, it waits over the open SSH socket for an exit code (e.g., 0 for success).

However, shutdown -h now tells the Linux kernel to immediately halt the system. The machine physically powers off faster than Ansible can receive the success code. As the machine dies, it abruptly severs the SSH socket. Ansible sees an unexpectedly dropped connection and classifies it as a total failure.

The Fix

In automation, context is everything. Because we expect the connection to drop, this error is actually a false positive. We need to tell our scripts to embrace it.

If you are wrapping Ansible commands in Bash, you can simply append || true to suppress the failure and allow the script to continue to the next machine:

ansible workers -i hosts.yaml -m command -a "shutdown -h now" || true

If you are using Ansible Playbooks, you have two cleaner options. You can explicitly ignore the error:

- name: Shut down the node
  command: shutdown -h now
  ignore_errors: yes

Or, even better, you can use the dedicated modules that are specifically programmed to expect and handle SSH disconnections gracefully:

- name: Shut down the node natively
  community.general.shutdown:

Key Takeaway

Not all errors are bugs. When writing infrastructure automation that alters the state of the network or the power of the machine itself, remember that "Connection refused" might just mean your command worked perfectly.