Skip to content
Snippets Groups Projects
  • Iustin Pop's avatar
    watcher: smarter handling of instance records · f5116c87
    Iustin Pop authored
    
    This patch implements a few changes to the instance handling. First, old
    instances which no longer exist on the cluster are removed from the
    state file, to keep things clean.
    
    Second, the instance restart counters are reset every 8 hours, since
    some error cases might be transient (e.g. networking issues, or machine
    temporarily down), and if the problem takes more than 5 restarts but is
    not permanent, watcher will not restart the instance. The value of 8
    hours is, I think, both conservative (as not to hammer the cluster too
    often with restarts) and fast enough to clear semi-transient problems.
    
    And last, if an instance is not restarted due to exhausted retries, this
    should be warned, otherwise it's hard to understand why watcher doesn't
    want to restart an ERROR_down instance.
    
    Signed-off-by: default avatarIustin Pop <iustin@google.com>
    Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
    f5116c87