1. 15 Mar, 2010 2 commits
  2. 12 Mar, 2010 2 commits
    • Michael Hanselmann's avatar
      utils.CreateBackup: Use human-readable instead of seconds since Epoch · 1d466a4f
      Michael Hanselmann authored
      
      
      Seconds since the Epoch are not easily readable by a human. Using a
      formatted timestamp makes it easier (e.g.
      “….backup-2010-03-12_14_02_43.…”). This patch also makes OS logfiles use
      this formatted timestamp.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      1d466a4f
    • Iustin Pop's avatar
      Improve cluster verify with hypervisor errors · 0cf5e7f5
      Iustin Pop authored
      
      
      In case the hypervisor has issues on one node, currently
      backend.VerifyNode will exit via an exception (two exit paths possible,
      one via HypervisorError from hypervisor.Verify(), and one via RPCFail
      from GetInstanceList). This is bad as it invalidates all other checks of
      that node.
      
      This patch catches these two errors and allows the rest of the
      VerifyNode function to run. This leads to a more complete verify cluster
      run, for example now only real missing LVs are reported, not all of
      them.
      
      The cluster verify is not perfect as it will skip some tests even if it
      has data, but this will require a more complete rewrite (see issue 90).
      
      Also, the patch fixes and improves some error messages in cmdlib.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      0cf5e7f5
  3. 09 Mar, 2010 1 commit
  4. 08 Mar, 2010 8 commits
  5. 22 Feb, 2010 2 commits
  6. 03 Feb, 2010 1 commit
  7. 01 Feb, 2010 1 commit
  8. 25 Jan, 2010 1 commit
  9. 04 Jan, 2010 4 commits
  10. 28 Dec, 2009 1 commit
  11. 14 Dec, 2009 2 commits
  12. 30 Nov, 2009 1 commit
  13. 25 Nov, 2009 1 commit
  14. 11 Nov, 2009 1 commit
  15. 06 Nov, 2009 1 commit
    • Iustin Pop's avatar
      Fix pylint 'E' (error) codes · 6c881c52
      Iustin Pop authored
      
      
      This patch adds some silences and tweaks the code slightly so that
      “pylint --rcfile pylintrc -e ganeti” doesn't give any errors.
      
      The biggest change is in jqueue.py, the move of _RequireOpenQueue out of
      the JobQueue class. Since that is actually a function and not a method
      (never used as such) this makes sense, and also silences two pylint
      errors.
      
      Another real code change is in utils.py, where FieldSet.Matches will
      return None instead of False for failure; this still works with the way
      this class/method is used, and makes more sense (it resembles more
      closely the re.match return values).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      6c881c52
  16. 05 Nov, 2009 1 commit
    • Michael Hanselmann's avatar
      Add new “daemon-util” script to start/stop Ganeti daemons · f154a7a3
      Michael Hanselmann authored
      
      
      Until now, Ganeti started and stopped its own daemons using custom functions.
      To start, the daemon was just executed and then sent the appropriate signals to
      stop it again. Init scripts would have to pay attention to the PID file and
      other things.
      
      With this patch, a new script is added (“daemon-util”, installed in
      $prefix/lib/ganeti/), centralizing the starting and stopping of daemons. The
      provided example init script is adjusted to use this new script. Ganeti's code
      no longer calls its own init script.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f154a7a3
  17. 04 Nov, 2009 1 commit
  18. 03 Nov, 2009 4 commits
  19. 02 Nov, 2009 2 commits
  20. 22 Oct, 2009 3 commits
    • Ken Wehr's avatar
      Adding '--no-ssh-init' option to 'gnt-cluster init'. · b989b9d9
      Ken Wehr authored
      
      
      Allows the initialization of a cluster without the creation or distribution
      of SSH key pairs. Includes changes for LeaveCluster and RPC.
      Signed-off-by: default avatarKen Wehr <ksw@google.com>
      Signed-off-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      b989b9d9
    • Iustin Pop's avatar
      Try to reduce wrong errors in InstanceShutdown · 3782acd7
      Iustin Pop authored
      
      
      In backend.InstanceShutdown(), there is a race condition between
      checking that the instance exists and trying to shut it down which
      translates sometime in error messages like:
      
      Tue Oct 20 20:08:30 2009 - WARNING: Could not shutdown instance: Failed
      to force stop instance instance9: Failed to stop instance instance9:
      exited with exit code 1, Error: Domain 'instance9' does not exist.
      
      To fix this, we ignore any hypervisor StopInstance() errors if the
      instance doesn't exist anymore, since our purpose (to make the instance
      go away) is already accomplished.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      3782acd7
    • Iustin Pop's avatar
      Revert breakage introduced in e4e9b806 · 7734de0a
      Iustin Pop authored
      Commit e4e9b806
      
       introduced two problems
      in backend.InstanceShutdown():
      
      - first, it reduced the check interval significantly (especially for the
        first few checks); there are very few production VMs that shutdown in
        one second, and while not breaking anything this creates unnecessary
        load for the hypervisor
      - second, a wrong test added to the while condition (“not tried_once”)
        means that we only sleep once for an instance, and after that we
        immediately kill it forcefully
      
      These two together means that any instance which is not lucky enough to
      finish in roughly 1-1.5 seconds (the time it takes to sleep and verify
      again the instance list) will have this happen:
      
      2009-10-21 23:33:46,034:  pid=16634 INFO Called for inst9 w. False/False
      2009-10-21 23:33:47,440:  pid=16634 ERROR Shutdown of 'inst9' unsuccessful, forcing
      2009-10-21 23:33:47,440:  pid=16634 INFO Called for inst9 w. True/False
      
      The “Called…” are logs from the hypervisor shutdown function. This means
      of course that at restart time:
      
      [12775866.644682] EXT3-fs: INFO: recovery required on readonly filesystem.
      [12775866.644689] EXT3-fs: write access will be enabled during recovery.
      [12775868.533674] kjournald starting.  Commit interval 5 seconds
      [12775868.533697] EXT3-fs: sda1: orphan cleanup on readonly fs
      [12775868.551797] EXT3-fs: sda1: 12 orphan inodes deleted
      [12775868.551803] EXT3-fs: recovery complete.
      [12775868.586275] EXT3-fs: mounted filesystem with ordered data mode.
      
      This patch reverts the broken test and changes the sleep to a fixed
      duration of five seconds, since it makes no sense to check that often
      for shutdown (and after ~20 seconds we anyway reach a stable value of
      five seconds).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      7734de0a