1. 26 Jul, 2010 4 commits
    • Iustin Pop's avatar
      Change the meaning of call_node_start_master · 91492e57
      Iustin Pop authored
      Currently, backend.StartMaster (the function behind this RPC call) will
      activate the master IP and then, if the start_daemons parameter is true,
      it will also activate the master role.
      While this works, it has two issues:
      - first, it will activate the master IP unconditionally, even if this
        node will not start the master daemon due to missing votes
      - second, the activation of the IP is done twice if start_daemons is
        true, because the master daemon does its own activation too
      This behaviour seems to be unmodified since Summer 2008, so probably any
      rationale on why this is done in two places is forgotten.
      The patch changes so that this function does *either* IP activation or
      master role activation but not both. So the IP will be activated only
      once (from the master daemon or from LURenameCluster), and it will only
      be done if the masterd got enough votes for startup.
      I can see only one downside to this change: if masterd won't actually
      start (due to missing votes), RAPI will still start, and without the
      master IP activated. But this is no worse than before, when both RAPI
      was running and the IP was activated.
      Note that the behaviour of StopMaster remains the same, as noone else
      does the IP removal.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
    • Iustin Pop's avatar
      masterd: move the IP activation from Exec to Check · 340f4757
      Iustin Pop authored
      Currently, the master IP activation is done in the Exec function. Since
      the original masterd process returns after forking, and Exec is run in
      the (grand)child process, this means that after 'ganeti-masterd' has
      returned there are still initialization tasks running.
      Normally this is not a problem, but in cases where one does quick master
      failovers, this creates a race condition which hits the QA scripts
      especially hard.
      To solve this, and make the startup process cleaner (the system is in
      steady state after the command has returned, even though masterd startup
      could still fail), we move the IP activation to Check(). This also
      allows error messages about the IP activation to be seen on the console.
      With this patch enabled, I can no longer reproduce the double-failover
      errors, which were occuring before in 4/5 cases.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
    • Iustin Pop's avatar
      Move the UsesRPC decorator from cli to rpc · e0e916fe
      Iustin Pop authored
      This is needed because not just the cli scripts need this decorator, but
      the master daemon too (and it already duplicated the code once).
      In cli.py we just leave a stub, so that we don't have to modify all the
      scripts to import rpc.py.
      We then change the master daemon code to reuse this decorator, instead
      of duplicating it.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
    • Iustin Pop's avatar
      watcher: smarter handling of instance records · f5116c87
      Iustin Pop authored
      This patch implements a few changes to the instance handling. First, old
      instances which no longer exist on the cluster are removed from the
      state file, to keep things clean.
      Second, the instance restart counters are reset every 8 hours, since
      some error cases might be transient (e.g. networking issues, or machine
      temporarily down), and if the problem takes more than 5 restarts but is
      not permanent, watcher will not restart the instance. The value of 8
      hours is, I think, both conservative (as not to hammer the cluster too
      often with restarts) and fast enough to clear semi-transient problems.
      And last, if an instance is not restarted due to exhausted retries, this
      should be warned, otherwise it's hard to understand why watcher doesn't
      want to restart an ERROR_down instance.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
  2. 23 Jul, 2010 6 commits
  3. 22 Jul, 2010 2 commits
  4. 21 Jul, 2010 3 commits
  5. 20 Jul, 2010 4 commits
  6. 19 Jul, 2010 3 commits
  7. 16 Jul, 2010 10 commits
  8. 15 Jul, 2010 6 commits
  9. 13 Jul, 2010 2 commits