Python, signal handlers, and exceptions

Never ever, ever raise a regular exception in a Python signal handler.

This is probably the best advice that I had never heard.  But after looking into an initial analysis of a timeout decorator bug it’s advice that I wish was prominently advertised.  So I’m publishing this little tale to let others know about this hidden gotcha so that, just maybe, when you have an opportunity to do this, a shiver will run down your spine, your memory will be tickled, and you’ll be able to avoid the pitfalls that I’ve recorded here.

When will I need to be reminded of this?

Signals are a means provided by an operating system to inform programs of  events that aren’t part of their normal program flow.  If you’re on a UNIX-like operating system and hit Control-C to cancel a program running in the shell, you’ve used a signal.  Control-C in the shell sends a SIGINT (Interrupt) signal to the program.  The program receives the SIGINT and a signal handler (a function which is written to take appropriate action when a signal is received) is invoked to handle it.  In most cases, for SIGINT, the signal handler tries to complete any pressing work (for instance, to finish writing data to a file so that the file isn’t left in a half-written state) and then causes the program to exit.

Python provides a signal library which is very similar to the C API underneath Python.  Python does a little bit of behind the scenes work to make sure that signals appear to be interacting with Python code rather than interfering with the interpreter’s implementation (meaning that signals will appear to happen between Python’s byte code instructions rather than leaving a Python instruction half-completed).

Your mission: exit when a signal occurs

If you search the internet for how to implement a timeout in Python, you’ll find tons of examples using signals, including one from the standard library documentation and one which is probably where the design for our original decorator came from.  So let’s create some quick test code to show how the signal works to implement a timeout

import signal
import time

def handler(signum, frame):
    print('Signal handler called with signal',
          signum)
    raise OSError("timeout exceeded!")

def long_function():
    time.sleep(10)

# Set the signal handler and a 1-second alarm
old_handler = signal.signal(signal.SIGALRM,
                            handler)
signal.alarm(1)
# This sleeps for longer than the alarm
start = time.time()
try:
    long_function()
except OSError as e:
    duration = time.time() - start
    print('Duration: %.2f' % duration)
    raise
finally:
    signal.signal(signal.SIGALRM, old_handler)
    signal.alarm(0) # Disable the alarm

This code is adapted from the example of implementing a timeout in the standard library documentation.  We first define a signal handler named handler() which will raise an OSError when it’s invoked. We then define a function, long_function() which is designed to take longer to run than our timeout. The we hook everything together:

  • On lines 13-14 we register the handler to be invoked when a SIGALRM occurs.
  • On line 15 we tell Python to emit SIGALRM after 1 second.
  • On line 19 we call long_function()
  • On line 20-23, we specify what happens when a timeout occurs
  • In the finally block on lines 24-26, we undo our signal handler by first re-registering the prior SIGALARM handler as the function to invoke if a new SIGALRM occurs and then disabling the alarm that we had previously set.

When we run the code we see that the signal handler raises the OSError as expected:

$ /bin/time -p python3 test.py
Signal handler called with signal 14
Duration: 1.00
Traceback (most recent call last):
  File "test.py", line 18, in
    long_function()
  File "test.py", line 11, in long_function
    time.sleep(2)
  File "test.py", line 8, in handler
    raise OSError("timeout exceeded!")
OSError: timeout exceeded!
real 1.04

Although long_function() takes 10 seconds to complete, the SIGALRM fires after 1 second.  That causes handler() to run which raises the OSError with the message timeout exceeded!.  The exception propagates to our toplevel where it is caught and prints Duration: 1.00 before re-raising so we can see the traceback.  We see that the output of /bin/time roughly agrees with the duration we calculated within the program… just a tad over 1 second to run the whole thing.

Exceptions seem to be working…?

It’s time to make our long_function code a little less trivial.

import signal
import time

def handler(signum, frame):
    print('Signal handler called with signal',
signum)
    raise OSError("timeout exceeded!")

def long_function():
    try:
        with open('/etc/passwd', 'r') as f:
            data = f.read()
            # Simulate reading a lot of data
            time.sleep(10)
    except OSError:
        # retry once:
        with open('/etc/passwd', 'r') as f:
            data = f.read()
            time.sleep(10)

# Set the signal handler and a 5-second alarm
old_handler = signal.signal(signal.SIGALRM,
handler)
signal.alarm(1)
start = time.time()
# This sleeps for longer than the alarm
try:
    long_function()
except OSError as e:
    duration = time.time() - start
    print('Duration: %.2f' % duration)
    raise
finally:
    signal.signal(signal.SIGALRM, old_handler)
    signal.alarm(0)        # Disable the alarm

We’ve changed long_function(), lines 9-20, to do two new things:

  1. It now reads from a file.  We still have time.sleep(), but that’s just to simulate a long read operation which will trigger our timeout.
  2. Open() and read() can raise a variety of exceptions.  If open() or read() raise an OSError exception, we want to retry reading the file once before we give up.

So what happens when we run this version?

$ /bin/time -p python3 test.py
Signal handler called with signal 14
real 11.07

That was unexpected…

As you can see from the output, our program still fired the SIGALRM.  And the signal handler still ran.  But after it ran, everything else seems to be different.  Apparently, the OSError didn’t propagate up the stack to our toplevel so we didn’t print the duration.  Furthermore, we ended up waiting for approximately 10 seconds in addition to the 1 second we waited for the timeout. What happened?

The key to understanding this is to understand that when the signal handler is invoked, Python adds the call to the signal handler onto its current call stack (The call stack roughly represents the nesting of functions that lead up to where the code is currently executing.  So, in our example, inside of the signal handler, the call stack would start with the module toplevel, then long_function(), and finally handler().  If you look at the traceback from the first example, you can see exactly that call stack leading you through all the function calls to the point where the exception was raised.)  When the signal handler raises its exception, Python unwinds the call stack one function at a time to find an exception handler (not to be confused with our signal handler) which can handle the exception.

Where was the program waiting when SIGALRM was emitted?  It was on line 14, inside of long_function().  So Python acts as though handler() was invoked directly after that line.  And when handler() raises its OSError exception, Python then unwinds the call stack from handler() to long_function() and sees that on line 15, there’s an except OSError: and so Python lets it catch the signal handler’s OSError instead of propagating it up the stack further.  And in our code, that exception handler decides to retry reading the file which is where there is a second 10 second delay as we read the file.  Since the SIGALRM was already used up, the timeout doesn’t fire this time.  So the rest of the bug progresses from there: long_function() now waits the full 10 seconds before returning because there’s no timeout to stop it.  It then returns normally to its caller.  The caller doesn’t receive an OSError exception.  So it doesn’t fire its own OSError exception handler which would have printed the Duration.

Okay, this isn’t good. Can it get worse?

There are even less intuitive ways that this bug can be provoked. For instance:

def long_function():
    time.sleep(0.8)
    try:
        time.sleep(0.1)
        time.sleep(0.1)
        time.sleep(0.1)
    except Exception:
        pass

In this version, we don’t have the context clue that we’re raising OSError in the signal handler and mistakenly catching it in long_function(). An experienced Python coder will realize that except Exception is sufficiently broad to be catching OSError, but in the clutter of other function calls and without the same named exception to provoke their suspicions initially, they might not realize that the occasional problems with the timeout not working could be triggered by this late exception handler.

This is also problematic if the programmer is having to orient themselves off of a traceback. That would lead them to look inside of long_function() for the source of a OSError. They won’t find it there because it’s inside of the signal handler which is outside of the function.

import zipfile
def long_function():
    time.sleep(0.9)
    zipfile.is_zipfile('my_file.zip')

In this case, there’s no exception handling in our code to clue us in. If you look at the implementation of zipfile.is_zipfile(), though, you’ll see that there’s an except OSError: inside of there. If the timeout’s alarm ends up firing while you are inside of that code, is_zipfile() will just as happily swallow the OSError as an exception handler inside of long_function() would have.

So I think I can solve this if I….

There are ways to make the timeout functionality more robust. For instance, let’s define our handler like this:

import functools

class OurTimeout(BaseException):
    pass

def handler(timeout, signum, frame):
    print('Signal handler called with signal',
          signum)
    signal.alarm(timeout)
    raise OurTimeout("timeout exceeded!")

# toplevel code:
one_second_timeout = functools.partial(handler, 1)
old_handler = signal.signal(signal.SIGALRM, one_second_timeout)
signal.alarm(1)
try:
    long_function()
except OurTimeout:
    print('Timeout from our timeout function)

The first thing you can see is that this now raises a custom timeout which inherits from BaseException. This is more robust than using standard timeouts which are rooted at Exception because well written Python code won’t catch exceptions inheriting directly from BaseException. It’s not foolproof, however, because it’s only convention that prevents someone from catching BaseException. And no matter how good a Python programmer you are, I’m sure you’ve been lazy enough to write code like this at least once:

def long_function():
    try:
        with open('/etc/passwd', 'r') as f:
            f.read()
    except:
        # I'll fix the possible exception classes later
        rollback()

Bare except: catches any exception, including BaseException so using a bare except will still break this timeout implementation.

The second thing this implementation changes is to reset the alarm inside of the signal handler. This is helpful as it means that the code will be able to timeout multiple times to get back to the toplevel exception handler. In our retry example, Both the initial attempt and the retry attempt would timeout so long_function() would end up taking only two seconds and fail due to our timeout in the end. However, there are still problems. For instance, this code ends up taking timeout * 3 seconds instead of just timeout seconds because the exception handler prevents us from hitting the break statement so we’ll have to keep hitting the timeout:

def long_function():
    for i in range(0, 3):
        try:
            time.sleep(10)
        except OSError:
            pass
        else:
            break

The following code, (arguably buggy because you *should* disable the alarm as the first thing in the recovery code instead of the last) can end up aborting the toplevel code’s recovery effort if timeout is too short:

try:
    long_function()
except OSError:
   rollback_partial_long_function()
   signal.alarm(0)

So even though this code is better than before, it is still fragile, error prone, and can do the wrong thing even with code that isn’t obviously wrong.

Doctor, is there still hope?

When I was looking at the problem with the timeout decorator in Ansible what struck me was that when using the decorator, the decorator was added outside of a function telling it to timeout but the timeout was occurring and potentially being processed inside of the function. That meant that it would always be unintuitive and error prone to someone trying to use the decorator:

@timeout.timeout(1)
def long_function():
    try:
        time.sleep(10)
    except:
        # Try an alternative
        pass

try:
    long_function()
except timeout.TimeoutError:
    print('long_function did not complete')

When looking at the code, the assumption is that on the inside there’s long_function, then outside of it, the timeout code, and outside of that the caller. So the expectation is that an exception raised by the timeout code should only be processed by the caller since exceptions in Python only propagate up the call stack. Since the decorator’s functionality was implemented via a signal handler, though, that expectation was disappointed.

To solve this, I realized that the way signals and exceptions interact would never allow exceptions to propagate correctly. So I switched from using a signal to using one thread for the timeout and one thread for running the function. Simplified, that flow looks like this (You can look at the code for the new decorator in Ansible if you’re okay with the GPLv3+ license. The following code is all mine in case you want to re-implement the idea without the GPLv3+ restrictions:

import multiprocessing
import multiprocessing.pool

def timeout(seconds=10):
    def decorator(func):
        def wrapper(*args, **kwargs):
            pool = multiprocessing.pool.ThreadPool(processes=1)
            results = pool.apply_async(func, args, kwargs)
            pool.close()
            try:
                return results.get(seconds)
            except multiprocessing.TimeoutError:
                raise OSError('Timeout expired after: %s' % seconds)
            finally:
                pool.terminate()
        return wrapper
    return decorator

@timeout(1)
def long_func():
    try:
        time.sleep(10)
    except OSError:
        print('Failure!')
    print('end of long_func')

try:
    long_func()
except OSError as e:
    print('Success!')
    raise

Edit: Davi Aguiar pointed out that the snippet was leaving the thread running after timeout. The example has been updated to add a pool.terminate() call inside of a finally: to take care of reaping the thread after the timeout expires.

As you can see, I create a multiprocessing.pool.ThreadPool with a single thread in it. Then I run the decorated function in that thread. I use res.get(timeout) with the timeout we set to get the results or raise an exception if the timeout is exceeded. If the timeout was exceeded, then I throw the OSError to be like our first example.

If all goes well, the exception handling inside of long_func won’t have any effect on what happens. Let’s see:

$ python2 test.py
Success!
Traceback (most recent call last):
  File "test.py", line 49, in
    long_func()
  File "test.py", line 35, in wrapper
    raise OSError('Timeout expired after: %s' % seconds)
OSError: Timeout expired after: 1

Yep, as you can see from the stack trace, the OSError is now being thrown from the decorator, not from within long_func(). So only the toplevel exception handler has a chance to handle the exception. This leads to more intuitive code and hopefully less bugs down the road.

So are there ever times when a signal handler is the right choice?

Signal handlers can be used for more than just timeouts. And they can do other things besides trying to cancel an ongoing function. For instance, the Apache web server has a signal handler for the HUP (HangUP) signal. When it receives that, it reloads its configuration from disk. If you take care to catch any potential exceptions during that process, you shouldn’t run across these caveats because these problems only apply to raising an exception from the signal handler.

When you do want to exit the function via a signal handler, I would be a little hesitant because you can never entirely escape from the drawbacks above and threading provides a more intuitive interface for the called and calling code.  I think that a few practices make it more robust, however.  As mentioned above:

  • Use a custom exception inheriting directly from BaseException to reduce the likelihood that unexpected code will catch it.
  • If you are doing it for a timeout, remember to reset the alarm inside of the signal handler in case something does catch your custom exception.  At least then you’ll continue to hit timeouts until you finally escape the code.

And one that wasn’t mentioned before:

It’s better to make an ad hoc exception-raising signal handler that handles a specific problem inside of a function rather than attempting to place it outside of the function.  For instance:

def timeout_handler():
    raise BaseException('Timeout')

def long_function():
    try:
        old_handler = signal.signal(signal.SIGALRM,
                                    timeout_handler)
        signal.alarm(1)
        time.sleep(10)   # Placeholder for the real,
                         # potentially blocking code
    except BaseException:
        print('Timed out!')
    finally:
        signal.signal(signal.SIGALRM, old_handler)
        signal.alarm(0)

long_function()

Why is it better to do it this way? If you do this, you localize the exception catching due to the signal handler with the code that could potentially raise the exception. It isn’t foolproof as you are likely still going to call helper functions (even in the stdlib) which could handle a conflicting exception but at least you are more clearly aware of the fact that the exception could potentially originate from any of your code inside this function rather than obfuscating all of that behind a function call.

Advertisements

Flock Ansible Talk

Hey Fedorans, I’m trying to come up with an Ansible Talk Proposal for FLOCK in Rochester. What would you like to hear about?

Some ideas:

* An intro to using ansible
* An intro to ansible-playbook
* Managing docker containers via ansible
* Using Ansible to do X ( help me choose a value for X 😉
* How to write your own ansible module
* How does ansible transform a playbook task into a python script on the remote system

Let me know what you’re interested in 🙂

Hacky, hacky playbook to upgrade several Fedora machines

Normally I like to polish my code a bit before publishing but seeing as Fedora 21 is relatively new and a vacation is coming up which people might use as an opportunity to upgrade their home networks, I thought I’d throw this extremely *unpolished and kludgey* ansible playbook out there for others to experiment with:

https://bitbucket.org/toshio/local-playbooks/src/3d3ae76a56034784ab60fcfa1129221c59a40f3b/provisioning/fedora-upgrade.yml?at=master

When I recently looked at updating all of my systems to Fedora 21 I decided to try to be a little lighter on network bandwidth than I usually am (I’m on slow DSL and I have three kids all trying to stream video at the same time as I’m downloading packages). So I decided that I’d use a squid caching proxy to cache the packages that I was going to be installing since many of the packages would end up on all of my machines. I found a page on caching packages for use with mock and saw that there were a bunch of steps that I probably wouldn’t remember the next time I wanted to do this. So I opened up vim, translated the steps into an ansible playbook, and tried to run it.

First several times, it failed because there were unsolved dependencies in my packageset (packages I’d built locally with outdated dependencies, packages that were no longer available in Fedora 21, etc). Eventually I set the fedup steps to ignore errors so that the playbook would clean up all the configuration and I could fix the package dependency problems and then re-run the playbook immediately afterwards. I’ve now got it to the point where it will successfully run fedup in my environment and will cache many packages (something’s still using mirrors instead of the baseurl I specified sometimes but I haven’t tracked that down yet. Those packages are getting cached more than once).

Anyhow, feel free to take a look, modify it to suit your needs, and let me know of any “bugs” that you find 🙂

Things I’ll probably do to it for when I update to F22:

Tales of an Ansible Newb: Ch.1 – Modules

In my last post I used the ansible command line to run an arbitrary shell command on several remote systems. Just being able to do that makes Ansible a useful tool in a sysadmin’s toolchest. However, that’s just the tip of what Ansible can do. Let’s go one step further today and explore some Ansible modules.

The command from the last post looked like this:

ansible '*' -i  'host1,host2' -a 'yum -y install tmux' -K --sudo

That let me run an ad hoc command on a set of hosts. You can run any command you want by specifying a different command line to -a:

$ ansible '*' -i 'host1,' -a 'hostname'
host1 | success | rc=0 >>
host1.lan

$ ansible '*' -i 'fedora20-test,' -a 'grep fedora /etc/os-release'
fedora20-test | success | rc=0 >>
ID=fedora
CPE_NAME="cpe:/o:fedoraproject:fedora:20"
HOME_URL="https://fedoraproject.org/"

Definitely useful. But what if you need to combine several commands together? For instance, let’s say we want to run one command if we’re talking to Fedora and a different one if we’re talking to Ubuntu? Well, we know how to use if-then-else in a shell script, maybe that will work here? Let’s give it a try!

$ ansible '*' -i 'fedora20-test,ubuntu14' -a \
> 'if test -f /etc/fedora-release ; then echo "Fedora!" ; else echo "Ubuntu!" ; fi'
ubuntu14 | FAILED | rc=2 >>
[Errno 2] No such file or directory

fedora20-test | FAILED | rc=2 >>
[Errno 2] No such file or directory

Nope that didn’t work. For a moment we might be puzzled why we get the error “No such file or directory” but then we realize that if is a shell builtin. If ansible isn’t creating a shell on the remote system then if won’t be found. So how can we do this?

Until now all of the commands I’ve given Ansible have been single programs and their command line parameters specified as a string to ansible’s -a parameter. But just what is the -a? Ansible’s man page has this information for us:

-a ‘ARGUMENTS’, –args=’ARGUMENTS’
The ARGUMENTS to pass to the module.

So what’s this module it’s talking about? Modules are small pieces of code that Ansible is able to run on a remote machine. By default, if no module is specified, Ansible runs the command module. The command module takes a single argument, a command line string which it then runs on the remote machine… with the important note that it directly invokes it so that we don’t have to worry about a remote shell interpreting any special characters . It does not pass it through the shell. But if we do need to use shell constructs (shell variables, redirection, builtins, etc) we can use the shell module instead. Woo hoo! Let’s try that:

$ ansible '*' -i 'fedora20-test,ubuntu14' -m shell -a \
'if test -f /etc/fedora-release ; then echo "Fedora!" ; else echo "Ubuntu!"
; fi'
ubuntu14 | success | rc=0 >>
Ubuntu!

fedora20-test | success | rc=0 >>
Fedora!

Excellent. So now we know how to use the full range of shell features if we need to. But I wonder what other modules Ansible ships with.

To find the answer to that, you can run:

$ ansible-doc -l

Which prints out a list of 240 or so modules which range from niche applications like managing specific brands of network devices to installing OS packages on various types of Linux distribution and copying files between the local system and remote hosts. You can find information about any one of these by running ansible-doc MODULE-NAME.

So let’s try out one of these modules. Let’s take our example of installing a package on several remote computers that we used last time and make it use the yum module instead of invoking yum via the command module. ansible-doc yum tells us that the yum module takes a few arguments of which name is mandatory and specifies a package name. state let’s us tell the module what state we want the package to be in when ansible exits. Let’s give it a try:

$ ansible '*' -i 'fedora20-test,' -m yum -a 'name=tmux,zsh state=latest' --sudo -K
sudo password: 
fedora20-test | success >> {
    "changed": true, 
    "msg": "", 
    "rc": 0, 
    "results": [
        "All packages providing tmux are up to date",.
        "Resolving Dependencies\nRunning transaction check\n
Package zsh.x86_64 0:5.0.7-1.fc20 will be updated\n
Package zsh.x86_64 0:5.0.7-4.fc20 will be an update\n
Finished Dependency Resolution\n\nDependencies Resolved\n\n
================================================================================\n
Package       Arch             Version                 Repository         Size\n
================================================================================\n
Updating:\n zsh           x86_64           5.0.7-4.fc20            updates           2.5 M\n\n
      Transaction Summary\n     
================================================================================\n
Upgrade  1 Package\n\nTotal download size: 2.5 M\nDownloading packages:\n
Not downloading deltainfo for updates, MD is 2.9 M and rpms are 2.5 M\n
Running transaction check\nRunning transaction test\nTransaction test succeeded\n
Running transaction (shutdown inhibited)\n
Updating   : zsh-5.0.7-4.fc20.x86_64                                      1/2 \n
Cleanup    : zsh-5.0.7-1.fc20.x86_64                                      2/2 \n
Verifying  : zsh-5.0.7-4.fc20.x86_64                                      1/2 \n  
Verifying  : zsh-5.0.7-1.fc20.x86_64                                      2/2 \n\n
Updated:\n  zsh.x86_64 0:5.0.7-4.fc20                                                     \n\n
Complete!\n"
    ]
}

Yep, as we expect telling ansible to use the yum module and specifying that the arguments to the module are name=tmux,zsh state=latest makes sure that the tmux and zsh packages are installed and at their latest available versions. But has using the yum module gained us anything? It’s about as long to type -m yum -a 'name=tmux,zsh state=latest as it is to type -a 'yum install -y tmux zsh and we already know the syntax for the latter. Is there a difference? For the yum module there isn’t much difference. The module and the single command do about the same thing. But there are other modules that do more. For instance, the git module.

Let’s say that you have a web application in a git repository and you want to deploy it directly from git. You need to clone the repository remotely and checkout a specific revision to the working tree because that’s what should be running in production, not the current HEAD of the master branch. You also need to checkout a few git submodules for projects that are hosted in different repositories. If you did this via the shell module, you’d have to use several different calls to git:

$ ansible '*' -i 'host1,' -a \
  'git clone http://git.example.com/project /var/www/webapp'
$ ansible '*' -i 'host1,' -m shell -a \
  'cd /var/www/webapp && git checkout fad30ad8'
$ ansible '*' -i 'host1,' -m shell -a \
  'cd /var/www/webapp && git submodule update --init'

The git module encapsulates git’s features in such a way that you can do all of this in one ansible call:

$ ansible '*' -i 'host1,' -m git -a \
  'repo=http://git.example.com/project state=present version=fad30ad8 recursive=yes dest=/var/www/webapp'

Kinda handy, right? Where modules really start to shine, though, is when we stop trying to run everything in a single ansible command line and start using playbooks to perform multiple steps in a single run. More on playbooks next time.

Tales of an Ansible Newb: Ch0.1 – What are these blog posts?

Last time I wrote about my first baby steps with Ansible, using it for ad hoc commands on multiple machines. When I first started working on Ansible three months ago, that was what 90% of my Ansible knowledge consisted of.

At work we were using Ansible to replace puppet as our configuration management system in Fedora Infrastructure but somehow I never understood the bigger picture of how it all fit together. Host_vars and group_vars and roles and inventory and playbooks that were included in other playbooks and playbooks that worked in conjunction with RHEL kickstart files to create a new host and playbooks that configured the hosts once they were created. And roles… where did roles fit into all of this?

Basically, the system at work was large and I had a hard time finding small enough pieces to master one at a time.

Now that I’m working on Ansible, I have an even larger need to understand how to use it. Code doesn’t exist in a vacuum; I need to know how people might use the code in order to better write the behaviours that they may want. So, on the theory that the best way to learn is by doing and having learned from the mistake of trying to understand too much all at once, I’ve started writing playbooks for my home network. Now, I’m not yet ambitious enough to try to replicate the production-quality or production goals that we had in an infrastructure like Fedora. I’m not (yet?) interested in tearing down my home network and provisioning it from scratch with Ansible. At the moment my ambition is simply to automate some repetitive tasks. As time goes on we’ll see what else happens 🙂

Tales of an Ansible Newb: Ch.0 – The Hook

The main thing I like about Ansible is that it has a very easy onramp. If you’ve done even a little bit of administration of a *nix system you’ve been exposed to the shell to interact with the system and ssh to connect you to a remote computer where you can interact with the remote system’s shell. At its most basic level, Ansible is just building on those pieces of knowledge.

For instance, let’s say you have a few computers at your house for you and your wife and your kids and the neighbors’ kids and the cat to use as a warm bed when the sun isn’t shining. You want to install a new piece of software on all of them. How do you do that?

Pre-Ansible it might look like this:

$ ssh host1
$ sudo yum -y install tmux
$ exit
$ ssh host2
$ sudo yum -y install tmux
$ exit
[...]

If you were feeling luc^Wconfident in your shell scripting you might try to automate some of that:

$ for host in host1 host2 [...] ; do
for> ssh -t $host1 sudo yum -y install tmux
for> done
[sudo] password for badger: [types password]
[waits while stuff gets done]
Connection to host1 closed.
[sudo] password for badger: [types password again]
[waits while stuff gets done]
Connection to host2 closed.
[...]

A little better but we still have to type your sudo password prompt once for every host. You still have to wait between typing in your password. You still have the tasks running one-by-one on the remote hosts even though they could be done in parallel.

At this point, depending on what type of person you are you’ll be thinking one of three things:

  • Yay! Automation!
  • I bet I can do better… let me just crack open the expect manual, open up
    my text editor, and maybe rewrite this in perl….
  • You know, someone else must have written an application for this…

I’m a programmer by nature so once upon a time I probably would have found myself running down path 2 with nary a backwards glance. But if time, grey hairs, and the prodding of your more experienced sysadmin friends teaches you anything, it’s that you shouldn’t invent your own security sensitive wheel when someone else’s is perfectly serviceable. So let’s reach for Ansible and see what it can do here:

ansible '*' -i 'host1,host2,[...]' -a 'yum -y install tmux' -K --sudo
sudo password:
fedora20 | success | rc=0 >>
Resolving Dependencies
--> Running transaction check
[...]
Installed:
tmux.x86_64 0:1.9a-2.fc20

Complete!

katahdin | success | rc=0 >>
Loaded plugins: langpacks
Resolving Dependencies
--> Running transaction check
[...]
Installed:
tmux.x86_64 0:1.9a-2.fc20

Complete!

Nice! So now I can run ad hoc commands on multiple ad hoc hosts, using sudo if I want to. I type my sudo password just once at the very beginning and then go do other things while Ansible runs my task on all the hosts I specified! That certainly makes managing my home network easier. I think I’ll be using this tool more often….