Skip to content
Snippets Groups Projects
user avatar
Iustin Pop authored
This patch implements a few changes to the instance handling. First, old
instances which no longer exist on the cluster are removed from the
state file, to keep things clean.

Second, the instance restart counters are reset every 8 hours, since
some error cases might be transient (e.g. networking issues, or machine
temporarily down), and if the problem takes more than 5 restarts but is
not permanent, watcher will not restart the instance. The value of 8
hours is, I think, both conservative (as not to hammer the cluster too
often with restarts) and fast enough to clear semi-transient problems.

And last, if an instance is not restarted due to exhausted retries, this
should be warned, otherwise it's hard to understand why watcher doesn't
want to restart an ERROR_down instance.

Signed-off-by: default avatarIustin Pop <iustin@google.com>
Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
f5116c87
Name Last commit Last update