We've been running into issues in the ceph-ansible project around restarting services in serial, in a way that is consistent and doesn't do unnecessary restarts. This goes back further than the above linked PR, and it's a pretty tricky thing to do! I've come up with a solution, and I'd like to run through the problem and changes with you.

Simple service restarts

In the simplest case, you would restart a service like this:

- name: Some change happens to host
    msg: "Some change happens to host"
    - restart my service

And then the handler would look like this:

- name: restart my service
    name: my_service
    state: restarted

This is fine for:

  • One host
  • Many hosts running in serial
  • Many hosts where you don't care if my_service is restarted at the same time on all the hosts.
Restarting in serial

Ok so, we need to ensure my_service restarts in serial, and doesn't restart at the same time on every host, how can we do that? A pretty common, and simple way would be to do this with the handler instead:

- name: restart my service
    name: my_service
    state: restarted
  with_items "{{ ansible_play_batch }}"
  run_once: True
  delegate_to: "{{ item }}"

This will take the first host that calls the handler, and use it to initiate a restart of my_service on each other host in the current play_batch. That would mean any hosts included in the play. If run in serial, it will only be the current host.

If you were to use ansible_play_hosts instead of ansible_play_batch it would ignore the state of serial and cause a restart on all hosts, so for this ansible_play_batch would make more sense.

This solves the key problem from before, and keeps the same benefits:

  • We can run this without restarting my_service on all hosts at the same time.
  • This will still work in serial, and only restart the host in the current serial run.

However, we still have one case that is causing us issues...

What happens if changes don't all happen to all hosts?

There are a few cases that can cause a situation where a change only happens to one of the hosts in the group:

  • host_vars change on one specific host initiates the handler on only one host.
  • A host is in multiple groups, and the change has already happened on the host, as part of the original group's run, meaning it has already initiated the handler and had the service restarted.

Essentially any situation where a change is not being applied to all hosts in the group, for whatever reason.

In this situation, the handler will be skipped on the hosts that have not initiated the change, but run_once will ensure that the handler runs at least once, so even if only the last host in the run has a change, that host will initiate a restart on ALL other hosts in the group - even though they do not need it, and haven't asked for it.

The reason is, we are relying on one host's call of the handler to mean that all hosts should be restarted. In a run where a change happens uniformly this is great. But that isn't always the case! We have no way to tell if a host has called the handler itself.

The delegate_to fix works fine for a lot of situations, as long as you can agree to the following assumptions:

  • You don't care that services are restarted unnecessarily.
  • Changes will always happen to ALL hosts in a group.
  • There is no situation where a change will be initiated on hosts in separate groups, in different orders. E.g. the same change happens to groupA and groupB, but groupA gets run first. This causes a situation where hosts in both groups will get restarted on both occasions.

Unfortunately, there is no fix for this in Ansible, ideally a serial mode on tasks would do the trick because then only hosts calling the handler would restart the service, and in serial. That doesn't exist though, there is a long issue report about it on github.

So what can we do about it?

How to solve this

As I mentioned the problem with delegate_to is that we are using one host to determine the requirement of a restart on other hosts based only on it's own state. We have no idea what the state of the other hosts is, or whether they called the handler or not.

Fortunately, we can work around this - Ansible allows the listen: directive to cause a set of tasks to be run in order, when a handler is called. Using this we can set and then test the required state of the hosts themselves manually, by doing the following:

- name: Set _restart_required var on before restart
    _restart_required: True
  listen: "restart my service"

- name: Restart the service in serial if it needs it
    name: my_service
    state: restarted
  when: hostvars[item]['_restart_required'] | default(False)
  with_items: "{{ ansible_play_batch }}"
  delegate_to: "{{ item }}"
  run_once: True
  listen: "restart my service"

- name: Set _restart_required var off after restart
    _restart_required: False
  listen: "restart my service"

This will execute the 3 tasks in order, starting with setting a fact to say it needs a restart, this will only be set if the host has called the handler and requires a restart. Next we restart my_service if the host has the fact set to True, implying it needs a restart. Finally, to ensure we don't restart services multiple times, we set the fact back to False so that if the host calls the handler again it will be set to True, and the service will restart.


I personally believe this should be a capability in Ansible itself, which is something I'd like to work on at some point!

I believe the above solution works well, and is about as good as I can think of since it will determine the restart based on whether the restart handler was called by the individual hosts.

If anybody knows a better and simpler way, please do let me know, would love to try it out.