Skip to content
Snippets Groups Projects
  1. Jul 26, 2010
    • Iustin Pop's avatar
      masterd: move the IP activation from Exec to Check · 340f4757
      Iustin Pop authored
      
      Currently, the master IP activation is done in the Exec function. Since
      the original masterd process returns after forking, and Exec is run in
      the (grand)child process, this means that after 'ganeti-masterd' has
      returned there are still initialization tasks running.
      
      Normally this is not a problem, but in cases where one does quick master
      failovers, this creates a race condition which hits the QA scripts
      especially hard.
      
      To solve this, and make the startup process cleaner (the system is in
      steady state after the command has returned, even though masterd startup
      could still fail), we move the IP activation to Check(). This also
      allows error messages about the IP activation to be seen on the console.
      
      With this patch enabled, I can no longer reproduce the double-failover
      errors, which were occuring before in 4/5 cases.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      340f4757
    • Iustin Pop's avatar
      Move the UsesRPC decorator from cli to rpc · e0e916fe
      Iustin Pop authored
      
      This is needed because not just the cli scripts need this decorator, but
      the master daemon too (and it already duplicated the code once).
      
      In cli.py we just leave a stub, so that we don't have to modify all the
      scripts to import rpc.py.
      
      We then change the master daemon code to reuse this decorator, instead
      of duplicating it.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      e0e916fe
    • Iustin Pop's avatar
      watcher: smarter handling of instance records · f5116c87
      Iustin Pop authored
      
      This patch implements a few changes to the instance handling. First, old
      instances which no longer exist on the cluster are removed from the
      state file, to keep things clean.
      
      Second, the instance restart counters are reset every 8 hours, since
      some error cases might be transient (e.g. networking issues, or machine
      temporarily down), and if the problem takes more than 5 restarts but is
      not permanent, watcher will not restart the instance. The value of 8
      hours is, I think, both conservative (as not to hammer the cluster too
      often with restarts) and fast enough to clear semi-transient problems.
      
      And last, if an instance is not restarted due to exhausted retries, this
      should be warned, otherwise it's hard to understand why watcher doesn't
      want to restart an ERROR_down instance.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      f5116c87
  2. Jul 23, 2010
  3. Jul 22, 2010
  4. Jul 21, 2010
  5. Jul 20, 2010
  6. Jul 19, 2010
  7. Jul 16, 2010
  8. Jul 15, 2010
  9. Jul 13, 2010
Loading