• Iustin Pop's avatar
    Fix bootstrap.MasterFailover race with watcher · 21004460
    Iustin Pop authored
    
    
    This fixes a recently diagnosed race condition between master failover
    and the watcher.
    
    Currently, the master failover first stops the master daemon, checks
    that the IP is no longer reachable, and then distributes the updated
    configuration. Between the stop and the distribution, it can happen that
    the watcher starts the master daemon on the old node again, since ssconf
    still points the master to it (and all nodes vote so).
    
    In even more weird cases, the master daemon starts and before it manages
    to open the configuration file, it is updated, which means the master
    will respond to QueryClusterInfo with another node as the real master.
    
    This patch reorders the actions during master failover:
    
    - first, we redistribute a fixed config; this means the old master will
      refuse to update its own config file and ssconf, and that most jobs
      that change state will fail to finish
    - we then immediately kill it; after this step, the watcher will be
      unable to start it, since the master will refuse startup
    - and only then we check for IP reachability, etc.
    
    I've tested the new version against concurrent launch of the watcher;
    while my tests are not very exhaustive, two things can happen: watcher
    see the daemons as dead, and tries to restart them, which also fail; or
    it simply get an error while reading from the master daemon. Both these
    should be OK.
    Signed-off-by: default avatarIustin Pop <iustin@google.com>
    Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
    21004460
bootstrap.py 27 KB