Skip to content
Snippets Groups Projects
  1. Jul 14, 2009
    • Guido Trotter's avatar
      ganeti-masterd: avoid SimpleConfigReader · b2890442
      Guido Trotter authored
      
      SimpleStore is a lot less heavyweight than SimpleConfigReader, and to
      just get the master name we can use that. This is the only usage of
      SimpleConfigReader currently, but we're not going to delete the class,
      as new usages will come in for ganeti-confd (in 2.1). Using it there,
      though, will make the class even more heavy to load, so it makes sense
      for this simple usage to be converted.
      
      Signed-off-by: default avatarGuido Trotter <ultrotter@google.com>
      b2890442
  2. Jul 08, 2009
  3. Jul 07, 2009
  4. Jun 29, 2009
  5. Jun 15, 2009
  6. Jun 09, 2009
    • Iustin Pop's avatar
      rpc: Add a simple failure reporting framework · 2cc6781a
      Iustin Pop authored
      
      This patch adds a simple failure reporting tool, similar to bdev's
      _ThrowError. In backend, we move towards the new-style RPC results (of
      type (status, payload)) and thus functions which use this style can very
      easily log and return the error message using this new function.
      
      The exception is declared here and not in errors.py since it's local to
      the node-daemon/backend combination.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      2cc6781a
  7. May 27, 2009
    • Iustin Pop's avatar
      Add a node powercycle command · f5118ade
      Iustin Pop authored
      
      This (somewhat big) patch adds support for remotely rebooting the nodes
      via whatever support the hypervisor has for such a concept.
      
      For KVM/fake (and containers in the future) this just uses sysrq plus a
      ‘reboot’ call if the sysrq method failed. For Xen, it first tries the
      above, and then Xen-hypervisor reboot (we first try sysrq since that
      just requires opening a file handle, whereas xen reboot means launching
      an external utility).
      
      The user interface is:
      
          # gnt-node powercycle node5
          Are you sure you want to hard powercycle node node5?
          y/[n]/?: y
          Reboot scheduled in 5 seconds
      
      The node reboots hopefully after sending the reply. In case the clock is
      broken, “time.sleep(5)” might take ages (but then I suspect SSL
      negotiation wouldn't work).
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f5118ade
  8. May 25, 2009
    • Iustin Pop's avatar
      watcher: automatically restart noded/rapi · c4f0219c
      Iustin Pop authored
      
      This patch makes the watcher automatically restart the node and rapi
      daemons, if they are not running (as per the PID file).
      
      This is not an exhaustive test; a better one would be TCP connect to the
      port, and an even better one a simple protocol ping (e.g. get / for rapi
      and a rpc_call_alive for noded), but since we don't know how they've
      been started we can't implement it today. rapi would need to write the
      SSL/port to a file, and noded something similar, so that we know how to
      connect.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      c4f0219c
    • Iustin Pop's avatar
      watcher: handle full and drained queue cases · 24edc6d4
      Iustin Pop authored
      
      Currently the watcher is broken when the queue is full, thus not
      fulfilling its job as a queue cleaner. It also doesn't handle nicely the
      queue drained status.
      
      This patch does a few changes:
        - first archive jobs, and only after submit jobs; this fixes the case
          where the queue is already full and there are jobs suited for
          archiving (but not the case where the jobs all too young to be
          archived)
        - handle nicely the job queue full and drained cases—instead of
          tracebacks, log such cases nicely
        - reverse the initial value and special cases for update_file; we now
          whitelist instead of blacklist cases, since we have much more
          blacklist cases than vice versa, and we set the flag to True only
          after the run is successful
      
      The last change, especially, is a significant one: now errors during the
      watcher run will not update the status file, and thus they won't be lost
      again in the logs.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      24edc6d4
  9. May 21, 2009
    • Iustin Pop's avatar
      Add a luxi call for multi-job submit · 2971c913
      Iustin Pop authored
      
      As a workaround for the job submit timeouts that we have, this patch
      adds a new luxi call for multi-job submit; the advantage is that all the
      jobs are added in the queue and only after the workers can start
      processing them.
      
      This is definitely faster than per-job submit, where the submission of
      new jobs competes with the workers processing jobs.
      
      On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
        - 100 jobs:
          - individual: submit time ~21s, processing time ~21s
          - multiple:   submit time 7-9s, processing time ~22s
        - 250 jobs:
          - individual: submit time ~56s, processing time ~57s
                        run 2:      ~54s                  ~55s
          - multiple:   submit time ~20s, processing time ~51s
                        run 2:      ~17s                  ~52s
      
      which shows that we indeed gain on the client side, and maybe even on
      the total processing time for a high number of jobs. For just 10 or so I
      expect the difference to be just noise.
      
      This will probably require increasing the timeout a little when
      submitting too many jobs - 250 jobs at ~20 seconds is close to the
      current rw timeout of 60s.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      2971c913
  10. May 20, 2009
  11. May 19, 2009
    • Iustin Pop's avatar
      watcher: try to restart the master if down · 7dfb83c2
      Iustin Pop authored
      
      Bugs in either our code or in associated libraries can bring the master daemon
      down, and this (due to the 2.0 architecture) stops all work on the cluster.
      
      Since the watcher already does periodic checks on the cluster, we modify
      it to try to start the master automatically in case of failures to
      connect. This will be tried only once per cycle.
      
      Also, in this case, we modify the code so that the watcher status file
      is not updated - its timestamp will reflect thus the time of last
      successful connection to the master.
      
      Side note: the except errors.ConfigurationError part could be cleaned
      up, since in 2.0 we don't usually get that directly, and if we do it's
      an error and we shouldn't touch the file anyway; but that is not a rc5
      change.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      7dfb83c2
  12. May 06, 2009
  13. May 05, 2009
  14. May 04, 2009
  15. Apr 06, 2009
    • Iustin Pop's avatar
      Disable synchronous (locking) queries · 77921a95
      Iustin Pop authored
      This patch raises an error in the master daemon in case the user
      requests a locking query; accordingly, all clients were modified to send
      only lockless queries. This is short-term fix, for proper fix the
      clients should be modified to submit a job when the user request a
      locking query.
      
      The other approach would be to ignore the flag passed by the client;
      this would be worse as client's wouldn't get at least an error.
      
      The possible impact of this is multiple:
        - some commands could have been not converted, and thus fail; this
          can be remedied easily
        - the consistency of commands is lost; e.g. node failover will not
          lock the node *while we get the node info*, so we could miss some
          data; this is again in the thread of atomic operations which are
          missing in the current model of query-and-act from gnt-* scripts
      
      Reviewed-by: imsnah, ultrotter
      77921a95
    • Iustin Pop's avatar
      Fix the output of watcher on non-master nodes · 2c404217
      Iustin Pop authored
      Currently the watcher spews errors message on non-master nodes. This
      cleans it up.
      
      Reviewed-by: imsnah
      2c404217
    • Iustin Pop's avatar
      Change the watcher to use jobs instead of queries · 6dfcc47b
      Iustin Pop authored
      As per the mailing list discussion, this patch changes the watcher to
      use a single job (two opcodes) for getting the cluster state (node list
      and instance list); it will then compute the needed actions based on
      this data.
      
      The patch also archives this job and the verify-disks job.
      
      Reviewed-by: imsnah
      6dfcc47b
    • Iustin Pop's avatar
      Add some more debugging info to masterd · e566ddbd
      Iustin Pop authored
      This patch will log data about queries, which are today completely
      invisible (at the default log level) in the master log file.
      
      Reviewed-by: imsnah
      e566ddbd
  16. Mar 09, 2009
    • Iustin Pop's avatar
      watcher: fix startup sequence locking the master · cc962d58
      Iustin Pop authored
      Currently, the watcher startup sequence does:
        - open a luxi client
        - get the instance list
        - get the node boot ids
        - open and lock the status file, and:
          - archive jobs
          - restart the down instances
          - check disks
      
      This, of course, can lead to problems when a node is (genuinely or not)
      locked for more than (watcher interval * maximum query clients) time. At
      that time, the master is completely unresponsive until the node is
      unlocked and all the watchers exit with error due to the state file
      being locked by the first instance.
      
      This patch reworks the startup sequence to first open/lock the status
      file, and only then open a luxi client. This should prevent the above
      case.
      
      Reviewed-by: ultrotter
      cc962d58
  17. Feb 27, 2009
    • Guido Trotter's avatar
      Create runtime dir in bootstrap · 9dae41ad
      Guido Trotter authored
      Some hypervisors (KVM) need RUN_GANETI_DIR to exist even at cluster init
      time. This patch creates it in InitCluster just before hv parameter
      checking. Since the code to make list of directories is already repeated
      twice in the code, and this would be the third time, we abstract it into
      an utils.EnsureDirs function and we call that one from ganti-noded,
      ganeti-masterd and bootstrap.
      
      Reviewed-by: iustinp
      9dae41ad
  18. Feb 24, 2009
    • Iustin Pop's avatar
      Remove the extra_args parameter in instance start · 07813a9e
      Iustin Pop authored
      This patch removes the extra_args parameter and instead switches the
      instance to the HV_KERNEL_ARGS hypervisor option.
      
      This is a big change, but it's a needed cleanup, this extra parameter on
      all RPC calls is not generic and we also need to have a persistent value
      here.
      
      Reviewed-by: imsnah
      07813a9e
  19. Feb 16, 2009
    • Iustin Pop's avatar
      watcher: fix checking of boot IDs · 3448aa22
      Iustin Pop authored
      The recent change (commit 2151) to the watcher to make it handle offline
      nodes also saves the offline attribute to the state file, but this is
      not needed and also breaks the checking of the boot ID. This patch
      simply removes it, restoring the correct behaviour.
      
      Reviewed-by: imsnah
      3448aa22
    • Iustin Pop's avatar
      watcher: autoarchive old jobs · f07521e5
      Iustin Pop authored
      This patch adds auto-archiving of jobs older than 6 hours to the
      watcher.
      
      Reviewed-by: imsnah
      f07521e5
  20. Feb 13, 2009
    • Iustin Pop's avatar
      RAPI: fixes related to write mode · 6e99c5a0
      Iustin Pop authored
      This patch fixes many small issues related to write functions:
        - update documentations w.r.t. how to add users
        - update the instance add function for latest API
        - add instance delete
        - fix addition of tags
        - update some error messages
      
      Reviewed-by: imsnah
      6e99c5a0
    • Iustin Pop's avatar
      RAPI: format error messages as JSON · 1f8588f6
      Iustin Pop authored
      This patch changes the format of the HTTP error messages from text/html, which
      is hard to parse from RAPI clients, to JSON which can be automatically parsed.
      
      The error message is an object, which contains always three keys:
        - code, an integer with the error code
        - message, a short description
        - explain, holding (if available) a description of the error
      
      In order to implement this, there is a bit of change to the http server
      and executor classes. I've tested and the error handling still works
      (but less optimal, no error message) in case the error formatting itself
      raises an exception.
      
      Reviewed-by: imsnah
      1f8588f6
    • Iustin Pop's avatar
      Make RAPI return 502/504 errors for luxi errors · 77e1d753
      Iustin Pop authored
      This changes the RAPI error codes for luxi errors; a timeout error is
      now reported properly as 504, while any other luxi error is reported as
      502.
      
      It would be good to convert even more errors into proper return codes in
      the future.
      
      Reviewed-by: imsnah
      77e1d753
    • Iustin Pop's avatar
      Fix ganeti-rapi startup with missing certificate · 6bb258a7
      Iustin Pop authored
      This patch displays a nicer error message compared to the default
      stacktrace.
      
      Reviewed-by: imsnah
      6bb258a7
Loading