1. 28 Oct, 2010 2 commits
  2. 27 Oct, 2010 17 commits
  3. 26 Oct, 2010 16 commits
  4. 25 Oct, 2010 3 commits
  5. 22 Oct, 2010 2 commits
    • Iustin Pop's avatar
      ConfigWriter: prevent using a foreign config · eb180fe2
      Iustin Pop authored
      If the configuration file doesn't denote this node as master, we prevent
      startup. This would have detected our previous race condition more
      easily, hence we add it as a permanent check.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
    • Iustin Pop's avatar
      Fix bootstrap.MasterFailover race with watcher · 21004460
      Iustin Pop authored
      This fixes a recently diagnosed race condition between master failover
      and the watcher.
      Currently, the master failover first stops the master daemon, checks
      that the IP is no longer reachable, and then distributes the updated
      configuration. Between the stop and the distribution, it can happen that
      the watcher starts the master daemon on the old node again, since ssconf
      still points the master to it (and all nodes vote so).
      In even more weird cases, the master daemon starts and before it manages
      to open the configuration file, it is updated, which means the master
      will respond to QueryClusterInfo with another node as the real master.
      This patch reorders the actions during master failover:
      - first, we redistribute a fixed config; this means the old master will
        refuse to update its own config file and ssconf, and that most jobs
        that change state will fail to finish
      - we then immediately kill it; after this step, the watcher will be
        unable to start it, since the master will refuse startup
      - and only then we check for IP reachability, etc.
      I've tested the new version against concurrent launch of the watcher;
      while my tests are not very exhaustive, two things can happen: watcher
      see the daemons as dead, and tries to restart them, which also fail; or
      it simply get an error while reading from the master daemon. Both these
      should be OK.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>