Skip to content
Snippets Groups Projects
  1. Sep 02, 2010
  2. Aug 18, 2010
  3. Jul 26, 2010
    • Iustin Pop's avatar
      watcher: smarter handling of instance records · f5116c87
      Iustin Pop authored
      
      This patch implements a few changes to the instance handling. First, old
      instances which no longer exist on the cluster are removed from the
      state file, to keep things clean.
      
      Second, the instance restart counters are reset every 8 hours, since
      some error cases might be transient (e.g. networking issues, or machine
      temporarily down), and if the problem takes more than 5 restarts but is
      not permanent, watcher will not restart the instance. The value of 8
      hours is, I think, both conservative (as not to hammer the cluster too
      often with restarts) and fast enough to clear semi-transient problems.
      
      And last, if an instance is not restarted due to exhausted retries, this
      should be warned, otherwise it's hard to understand why watcher doesn't
      want to restart an ERROR_down instance.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      f5116c87
  4. Jul 09, 2010
  5. Jul 01, 2010
    • Michael Hanselmann's avatar
      RAPI client: Switch to pycURL · 2a7c3583
      Michael Hanselmann authored
      
      Currently the RAPI client uses the urllib2 and httplib modules from
      Python's standard library. They're used with pyOpenSSL in a very fragile
      way, and there are known issues when receiving large responses from a RAPI
      server.
      
      By switching to PycURL we leverage the power and stability of the
      widely-used curl library (libcurl). This brings us much more flexibility
      than before, and timeouts were easily implemented (something that would
      have involved a lot of work with the built-in modules).
      
      There's one small drawback: Programs using libcurl have to call
      curl_global_init(3) (available as pycurl.global_init) while exactly one
      thread is running (e.g. before other threads) and are supposed to call
      curl_global_cleanup(3) (available as pycurl.global_cleanup) upon exiting.
      See the manpages for details. A decorator is provided to simplify this.
      
      Unittests for the new code are provided, increasing the test coverage of
      the RAPI client from 74% to 89%.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      2a7c3583
  6. Jun 30, 2010
  7. Jun 03, 2010
  8. Apr 09, 2010
  9. Apr 08, 2010
  10. Mar 23, 2010
  11. Mar 08, 2010
  12. Feb 26, 2010
  13. Feb 23, 2010
  14. Jan 28, 2010
  15. Jan 04, 2010
  16. Nov 25, 2009
  17. Nov 05, 2009
    • Michael Hanselmann's avatar
      Add new “daemon-util” script to start/stop Ganeti daemons · f154a7a3
      Michael Hanselmann authored
      
      Until now, Ganeti started and stopped its own daemons using custom functions.
      To start, the daemon was just executed and then sent the appropriate signals to
      stop it again. Init scripts would have to pay attention to the PID file and
      other things.
      
      With this patch, a new script is added (“daemon-util”, installed in
      $prefix/lib/ganeti/), centralizing the starting and stopping of daemons. The
      provided example init script is adjusted to use this new script. Ganeti's code
      no longer calls its own init script.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f154a7a3
  18. Sep 18, 2009
  19. Aug 26, 2009
  20. Jul 24, 2009
  21. May 25, 2009
    • Iustin Pop's avatar
      watcher: automatically restart noded/rapi · c4f0219c
      Iustin Pop authored
      
      This patch makes the watcher automatically restart the node and rapi
      daemons, if they are not running (as per the PID file).
      
      This is not an exhaustive test; a better one would be TCP connect to the
      port, and an even better one a simple protocol ping (e.g. get / for rapi
      and a rpc_call_alive for noded), but since we don't know how they've
      been started we can't implement it today. rapi would need to write the
      SSL/port to a file, and noded something similar, so that we know how to
      connect.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      c4f0219c
    • Iustin Pop's avatar
      watcher: handle full and drained queue cases · 24edc6d4
      Iustin Pop authored
      
      Currently the watcher is broken when the queue is full, thus not
      fulfilling its job as a queue cleaner. It also doesn't handle nicely the
      queue drained status.
      
      This patch does a few changes:
        - first archive jobs, and only after submit jobs; this fixes the case
          where the queue is already full and there are jobs suited for
          archiving (but not the case where the jobs all too young to be
          archived)
        - handle nicely the job queue full and drained cases—instead of
          tracebacks, log such cases nicely
        - reverse the initial value and special cases for update_file; we now
          whitelist instead of blacklist cases, since we have much more
          blacklist cases than vice versa, and we set the flag to True only
          after the run is successful
      
      The last change, especially, is a significant one: now errors during the
      watcher run will not update the status file, and thus they won't be lost
      again in the logs.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      24edc6d4
  22. May 20, 2009
  23. May 19, 2009
    • Iustin Pop's avatar
      watcher: try to restart the master if down · 7dfb83c2
      Iustin Pop authored
      
      Bugs in either our code or in associated libraries can bring the master daemon
      down, and this (due to the 2.0 architecture) stops all work on the cluster.
      
      Since the watcher already does periodic checks on the cluster, we modify
      it to try to start the master automatically in case of failures to
      connect. This will be tried only once per cycle.
      
      Also, in this case, we modify the code so that the watcher status file
      is not updated - its timestamp will reflect thus the time of last
      successful connection to the master.
      
      Side note: the except errors.ConfigurationError part could be cleaned
      up, since in 2.0 we don't usually get that directly, and if we do it's
      an error and we shouldn't touch the file anyway; but that is not a rc5
      change.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      7dfb83c2
  24. Apr 06, 2009
    • Iustin Pop's avatar
      Fix the output of watcher on non-master nodes · 2c404217
      Iustin Pop authored
      Currently the watcher spews errors message on non-master nodes. This
      cleans it up.
      
      Reviewed-by: imsnah
      2c404217
    • Iustin Pop's avatar
      Change the watcher to use jobs instead of queries · 6dfcc47b
      Iustin Pop authored
      As per the mailing list discussion, this patch changes the watcher to
      use a single job (two opcodes) for getting the cluster state (node list
      and instance list); it will then compute the needed actions based on
      this data.
      
      The patch also archives this job and the verify-disks job.
      
      Reviewed-by: imsnah
      6dfcc47b
  25. Mar 09, 2009
    • Iustin Pop's avatar
      watcher: fix startup sequence locking the master · cc962d58
      Iustin Pop authored
      Currently, the watcher startup sequence does:
        - open a luxi client
        - get the instance list
        - get the node boot ids
        - open and lock the status file, and:
          - archive jobs
          - restart the down instances
          - check disks
      
      This, of course, can lead to problems when a node is (genuinely or not)
      locked for more than (watcher interval * maximum query clients) time. At
      that time, the master is completely unresponsive until the node is
      unlocked and all the watchers exit with error due to the state file
      being locked by the first instance.
      
      This patch reworks the startup sequence to first open/lock the status
      file, and only then open a luxi client. This should prevent the above
      case.
      
      Reviewed-by: ultrotter
      cc962d58
  26. Feb 24, 2009
    • Iustin Pop's avatar
      Remove the extra_args parameter in instance start · 07813a9e
      Iustin Pop authored
      This patch removes the extra_args parameter and instead switches the
      instance to the HV_KERNEL_ARGS hypervisor option.
      
      This is a big change, but it's a needed cleanup, this extra parameter on
      all RPC calls is not generic and we also need to have a persistent value
      here.
      
      Reviewed-by: imsnah
      07813a9e
  27. Feb 16, 2009
    • Iustin Pop's avatar
      watcher: fix checking of boot IDs · 3448aa22
      Iustin Pop authored
      The recent change (commit 2151) to the watcher to make it handle offline
      nodes also saves the offline attribute to the state file, but this is
      not needed and also breaks the checking of the boot ID. This patch
      simply removes it, restoring the correct behaviour.
      
      Reviewed-by: imsnah
      3448aa22
    • Iustin Pop's avatar
      watcher: autoarchive old jobs · f07521e5
      Iustin Pop authored
      This patch adds auto-archiving of jobs older than 6 hours to the
      watcher.
      
      Reviewed-by: imsnah
      f07521e5
  28. Feb 04, 2009
    • Iustin Pop's avatar
      Implement lockless query operations · ec79568d
      Iustin Pop authored
      This patch adds the framework for, and enables lockless OpQueryInstances. This
      means that instances will be shown in ERROR_up or ERROR_down state, even though
      this is not an error (but just an in-progress job).
      
      The framework is implemented as follows:
        - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
          an additional “use_locking” flag which will denote whether to lock
          or not; this patch only implements this for LUQueryInstances
        - the luxi query functions take an additional argument use_locking
          which is passed to the master daemon, and then passed to the above
          opcodes
        - cli.py export a new SYNC_OPT command line options which implement
          setting this flag to true
        - except for gnt-instance list, which uses this option, and for
          name-only queries (e.g. QueryNodes(fields=["names"])), all other
          callers are setting this flag to True
        - RAPI also sets the flag to True
      
      The patch was tested with a continuous (0.2s sleep in-between)
      gnt-instance list during a burnin, and no problems were observed.
      
      Reviewed-by: ultrotter
      ec79568d
Loading