1. 25 May, 2009 1 commit
    • Iustin Pop's avatar
      watcher: handle full and drained queue cases · 24edc6d4
      Iustin Pop authored
      Currently the watcher is broken when the queue is full, thus not
      fulfilling its job as a queue cleaner. It also doesn't handle nicely the
      queue drained status.
      This patch does a few changes:
        - first archive jobs, and only after submit jobs; this fixes the case
          where the queue is already full and there are jobs suited for
          archiving (but not the case where the jobs all too young to be
        - handle nicely the job queue full and drained cases—instead of
          tracebacks, log such cases nicely
        - reverse the initial value and special cases for update_file; we now
          whitelist instead of blacklist cases, since we have much more
          blacklist cases than vice versa, and we set the flag to True only
          after the run is successful
      The last change, especially, is a significant one: now errors during the
      watcher run will not update the status file, and thus they won't be lost
      again in the logs.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
  2. 20 May, 2009 1 commit
  3. 19 May, 2009 1 commit
    • Iustin Pop's avatar
      watcher: try to restart the master if down · 7dfb83c2
      Iustin Pop authored
      Bugs in either our code or in associated libraries can bring the master daemon
      down, and this (due to the 2.0 architecture) stops all work on the cluster.
      Since the watcher already does periodic checks on the cluster, we modify
      it to try to start the master automatically in case of failures to
      connect. This will be tried only once per cycle.
      Also, in this case, we modify the code so that the watcher status file
      is not updated - its timestamp will reflect thus the time of last
      successful connection to the master.
      Side note: the except errors.ConfigurationError part could be cleaned
      up, since in 2.0 we don't usually get that directly, and if we do it's
      an error and we shouldn't touch the file anyway; but that is not a rc5
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
  4. 06 Apr, 2009 2 commits
    • Iustin Pop's avatar
      Fix the output of watcher on non-master nodes · 2c404217
      Iustin Pop authored
      Currently the watcher spews errors message on non-master nodes. This
      cleans it up.
      Reviewed-by: imsnah
    • Iustin Pop's avatar
      Change the watcher to use jobs instead of queries · 6dfcc47b
      Iustin Pop authored
      As per the mailing list discussion, this patch changes the watcher to
      use a single job (two opcodes) for getting the cluster state (node list
      and instance list); it will then compute the needed actions based on
      this data.
      The patch also archives this job and the verify-disks job.
      Reviewed-by: imsnah
  5. 09 Mar, 2009 1 commit
    • Iustin Pop's avatar
      watcher: fix startup sequence locking the master · cc962d58
      Iustin Pop authored
      Currently, the watcher startup sequence does:
        - open a luxi client
        - get the instance list
        - get the node boot ids
        - open and lock the status file, and:
          - archive jobs
          - restart the down instances
          - check disks
      This, of course, can lead to problems when a node is (genuinely or not)
      locked for more than (watcher interval * maximum query clients) time. At
      that time, the master is completely unresponsive until the node is
      unlocked and all the watchers exit with error due to the state file
      being locked by the first instance.
      This patch reworks the startup sequence to first open/lock the status
      file, and only then open a luxi client. This should prevent the above
      Reviewed-by: ultrotter
  6. 24 Feb, 2009 1 commit
    • Iustin Pop's avatar
      Remove the extra_args parameter in instance start · 07813a9e
      Iustin Pop authored
      This patch removes the extra_args parameter and instead switches the
      instance to the HV_KERNEL_ARGS hypervisor option.
      This is a big change, but it's a needed cleanup, this extra parameter on
      all RPC calls is not generic and we also need to have a persistent value
      Reviewed-by: imsnah
  7. 16 Feb, 2009 2 commits
    • Iustin Pop's avatar
      watcher: fix checking of boot IDs · 3448aa22
      Iustin Pop authored
      The recent change (commit 2151) to the watcher to make it handle offline
      nodes also saves the offline attribute to the state file, but this is
      not needed and also breaks the checking of the boot ID. This patch
      simply removes it, restoring the correct behaviour.
      Reviewed-by: imsnah
    • Iustin Pop's avatar
      watcher: autoarchive old jobs · f07521e5
      Iustin Pop authored
      This patch adds auto-archiving of jobs older than 6 hours to the
      Reviewed-by: imsnah
  8. 04 Feb, 2009 1 commit
    • Iustin Pop's avatar
      Implement lockless query operations · ec79568d
      Iustin Pop authored
      This patch adds the framework for, and enables lockless OpQueryInstances. This
      means that instances will be shown in ERROR_up or ERROR_down state, even though
      this is not an error (but just an in-progress job).
      The framework is implemented as follows:
        - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
          an additional “use_locking” flag which will denote whether to lock
          or not; this patch only implements this for LUQueryInstances
        - the luxi query functions take an additional argument use_locking
          which is passed to the master daemon, and then passed to the above
        - cli.py export a new SYNC_OPT command line options which implement
          setting this flag to true
        - except for gnt-instance list, which uses this option, and for
          name-only queries (e.g. QueryNodes(fields=["names"])), all other
          callers are setting this flag to True
        - RAPI also sets the flag to True
      The patch was tested with a continuous (0.2s sleep in-between)
      gnt-instance list during a burnin, and no problems were observed.
      Reviewed-by: ultrotter
  9. 13 Jan, 2009 1 commit
  10. 11 Dec, 2008 1 commit
    • Iustin Pop's avatar
      Fix epydoc format warnings · c41eea6e
      Iustin Pop authored
      This patch should fix all outstanding epydoc parsing errors; as such, we
      switch epydoc into verbose mode so that any new errors will be visible.
      Reviewed-by: imsnah
  11. 05 Dec, 2008 1 commit
    • Iustin Pop's avatar
      watcher: handle offline nodes better · cbfc4681
      Iustin Pop authored
      This patch changes the LUQueryInstances to show a different state for
      offline nodes and also modifies the watcher to understand the offline
      state in its checks.
      Reviewed-by: ultrotter
  12. 20 Oct, 2008 1 commit
    • Iustin Pop's avatar
      Remove the logger.py module · 82d9caef
      Iustin Pop authored
      Since now we use only one function from the logger module
      (SetupLogging), we move it to utils.py (which is already imported by all
      users of this function), and we remove the module.
      Reviewed-by: imsnah
  13. 01 Oct, 2008 4 commits
    • Michael Hanselmann's avatar
      Convert ganeti-watcher · 2859b87b
      Michael Hanselmann authored
      Use RPC calls instead of ssconf.
      Reviewed-by: iustinp
    • Iustin Pop's avatar
      Fix the watcher with down nodes · 37b77b18
      Iustin Pop authored
      The watcher didn't handle the down nodes, fix this by ignoring (in
      secondary node reboot checks) any node that doesn't return a boot id.
      Reviewed-by: imsnah
    • Iustin Pop's avatar
      Fix the watcher not restarting instance bug · b7309a0d
      Iustin Pop authored
      The watcher was using conflicting attributes of the instance:
        - it queried the admin_/oper_state, which are booleans
        - but it compared those to the status (which is a text field)
      The code was changed to query the aggregated 'status' field, as that
      will also return indication of node problems, and we can use this only
      one field for all decisions. We still ask for the admin_state field as
      that is needed for the activate disks check (in secondary node restart).
      The patch also touches the watcher in some other parts:
        - log exceptions nicer
        - convert a method to @staticmethod
        - remove unused imports
      Reviewed-by: imsnah
    • Iustin Pop's avatar
      Remove last use of utils.RunCmd from the watcher · 5188ab37
      Iustin Pop authored
      The watcher has one last use of ganeti commands as opposed to sending
      requests via luxi. The patch changes this to use the cli functions.
      The patch also has two other changes:
        - fix the docstring for OpVerifyDisks (found out while converting
        - enable stderr logging on the watcher when “-d” is passes
      Reviewed-by: imsnah
  14. 07 Aug, 2008 1 commit
  15. 30 Jul, 2008 1 commit
    • Iustin Pop's avatar
      Unify SetupDaemon/SetupLogging · 59f187eb
      Iustin Pop authored
      The 'old-style' info, error, debug logs do not make much sense. This
      patch unifies the SetupLogging and SetupDaemon functions. As a result,
      all the commands logs to a 'commands.log' file.
      The patch also changes the log setup to keep going if there's an error
      in setting up the file logging but we're logging to stderr.
      Also, burnin now logs to its own file (burnin.log).
      Reviewed-by: ultrotter
  16. 10 Jul, 2008 1 commit
  17. 04 Jul, 2008 1 commit
    • Iustin Pop's avatar
      Fix some issues with the watcher · 26517d45
      Iustin Pop authored
      This patch fixes two bugs:
        - the state file is not saved because we use the method for checking
          for udpated data
        - in two places 'Error' was used instead of 'Exception', which breaks
          error handling
        - the unused 're' import has been removed
        - a variable named 'id' which collides with a builtin function has
          been renamed
      Note that comparing the serialized forms might create false negatives
      (due to the dicts being reordered) but that will just cause an extra
      write of the file, which is sub-optimal but harmless.
      Reviewed-by: ultrotter
  18. 03 Jul, 2008 1 commit
    • Iustin Pop's avatar
      Add custom logging setup for daemons · 3b316acb
      Iustin Pop authored
      It's better for daemons if:
        - they log only to one log file
        - the log level is included
        - for debug runs, the filename/line number is included
      This patch moves the custom formatter from the watcher to the logging
      module and generalizes it; then it changes the master daemon to use this
      function instead of the generic logging (which might be deprecated
      anyway in the future).
      Reviewed-by: imsnah
  19. 18 Jun, 2008 8 commits
  20. 13 May, 2008 2 commits
    • Iustin Pop's avatar
      Watcher: do not activate disks for started instances · eee1fa2d
      Iustin Pop authored
      Currently the watcher runs first the instance startup and then the
      boot-id method of disk reactivation. However, irrelevant of the fact
      that a node has rebooted or not, if we just started an instance, there's
      no need for its disks to be activated again, since the start instance
      has done that (if it is at all possible).
      The patch modifies the watcher to remember all started instances and not
      run activate-disks for them.
      Reviewed-by: ultrotter
    • Iustin Pop's avatar
      Watcher: do not activate disks for admin_down · 0c0f834d
      Iustin Pop authored
      Currently the watcher does activate disks (via bootid mechanisms) even
      for admin_down instances.  This patch logs and skips over these
      Reviewed-by: ultrotter
  21. 12 Dec, 2007 1 commit
  22. 03 Dec, 2007 1 commit
  23. 13 Nov, 2007 1 commit
  24. 10 Oct, 2007 2 commits
  25. 21 Sep, 2007 1 commit
    • Iustin Pop's avatar
      Remove requirement that host names are FQDN · 89e1fc26
      Iustin Pop authored
      We currently require that hostnames are FQDN not short names
      (node1.example.com instead of node1). We can allow short names as long
        - we always resolve the names as returned by socket.gethostname()
        - we rely on having a working resolver
      These issues are not as big as may seem, as we only did gethostname() in
      a few places in order to check for the master; we already required
      working resolver all over the code for the other nodes names (and thus
      requiring the same for the current node name is normal).  The patch
      moves some resolver calls from within execution path to the checking
      path (which can abort without any problems). It is important that after
      this patch is applied, no name resolving is called from the execution
      path (LU.Exec() or other code that is called from within those methods)
      as in this case we get much better code flow.
      This patch also changes the functions for doing name lookups and
      encapsulates all functionality in a single class.
      The final change is that, by requiring working resolver at all times, we
      can change the 'return None' into an exception and thus we don't have to
      check manually each time; only some special cases will check
      (ganeti-daemon and ganeti-watcher which are not covered by the
      generalized exception handling in cli.py). The code is cleaner this way.
      Reviewed-by: imsnah
  26. 14 Aug, 2007 1 commit
    • Iustin Pop's avatar
      Style changes for pep-8 and python-3000 compliance. · 3ecf6786
      Iustin Pop authored
      This changes the raising of exceptions from:
        raise Exception, value
        raise Exception(value)
      as the first form will be removed in python-3000 and the second form is
      preferred now.
      The changes also involve a few cases of changing from raising standard
      exceptions and use our own ones.
      The new version also fixes many pylint-generated warnings, especially in
      ganeti-noded where I changed many methods to @staticmethod.
      There is no functionality changed (barring any bugs).