1. 30 Jul, 2008 2 commits
    • Iustin Pop's avatar
      Rework master startup/shutdown/failover · b1b6ea87
      Iustin Pop authored
      This (big) patch reworks the master startup/shutdown and the fixes the
      master failover.
      
      What does the patch do?
      
      For master start/stop:
        - remove the old ganeti-master script and its associated man page
        - moves the ip start/stop directly into the backend.(Start|Stop)Master
        - adds start/stop of the master/rapi daemon into these functions,
          selectively based on the start/stop arguments
        - makes the master call via rpc StartMaster(start_daemons=False) to
          the local node so that the master IP is started
        - and finally changes the example init.d script to directly start and
          stop all three daemons, since they do the right thing (depending on
          master/not master role)
      
      For master failover:
        - moves the code from LUMasterFailover into bootstrap.MasterFailover,
          since we need to start/stop the master during this operation and
          thus it can't be executed from the master
        - removes the LUMasterFailover and its associated opcode
      
      Notes: ubuntu's /etc/lsb-base-logging.sh is dumb, so the messages 'not
      master' are not seen during startup on non-master nodes.
      
      Reviewed-by: ultrotter
      b1b6ea87
    • Iustin Pop's avatar
      Add a new parameter to backend.(Start|Stop)Master · 1c65840b
      Iustin Pop authored
      This patch adds a new, unused for now, parameter to the start and stop
      master operations in backend. The idea behind it is that we need to be
      able to control whether the IP (de)activation is coupled with daemon
      startup/shutdown.
      
      The callers are also modified to pass this parameter (even if unused for
      now).
      
      Reviewed-by: ultrotter
      1c65840b
  2. 23 Jul, 2008 1 commit
    • Iustin Pop's avatar
      Distribute the queue serial file after each update · c3f0a12f
      Iustin Pop authored
      This patch adds distribution of the queue serial file after each write
      to it (but before a new job is created and written with that ID, and
      before a response is returned, so we should be safe from crashes in
      between).
      
      Currently it only logs if a node cannot be contacted, it should abort if
      > 50% errors are seen.
      
      Reviewed-by: imsnah
      c3f0a12f
  3. 11 Jul, 2008 3 commits
    • Iustin Pop's avatar
      Convert backend.py to the logging module · 18682bca
      Iustin Pop authored
      The patch also switches some of the exception logs to use
      logging.exception (and therefore the log message will have a diferent
      format).
      
      (Note that this might not be a good choice in all cases, though)
      
      Reviewed-by: imsnah
      18682bca
    • Iustin Pop's avatar
      Fix backend.NodeVolumes handling of LVM output · a17a7623
      Iustin Pop authored
      This is the same fix as for GetVolumeList.
      
      I've checked manually and all other places that call lvm commands are
      already checking the output validity in terms of correct number of
      fields.
      
      Reviewed-by: ultrotter
      a17a7623
    • Iustin Pop's avatar
      Fix backend.GetVolumeList handling of LVM output · df4c2628
      Iustin Pop authored
      Sometimes ‘lvs’ can spit error messages on stdout, even when one wants
      to parse the output:
      ...
      Inconsistent metadata copies found - updating to use version 2776
      ...
      
      So we need to validate the output to guard against such cases.
      
      The patch converts the split on the separater to match against a regex
      and extract the fields via groups. The original separator choice is a
      bad one now :(
      
      Reviewed-by: imsnah
      df4c2628
  4. 27 Jun, 2008 2 commits
  5. 20 Jun, 2008 1 commit
    • Iustin Pop's avatar
      Add a rpc call for BlockDev.Close() · d61cbe76
      Iustin Pop authored
      This patch adds rpc layer calls (in rpc.py and the equivalent in
      ganeti-noded) to close a list of block devices, and the wrapper in
      backend.py that takes a list of Disk objects, identifies them and
      returns correctly formatted results.
      
      The reason why this very basic call was missing until now from the rpc
      layer is that we usually don't care about device closes (though we
      should, and will do so in the future) as only drbd has a meaningful
      Close() operation; right now we directly do Shutdown().
      
      The patch is clean enough that it's actually independent of the live
      migration implementation.
      
      Reviewed-by: imsnah
      d61cbe76
  6. 16 Jun, 2008 2 commits
    • Iustin Pop's avatar
      Expose block device grow in backend.py · 594609c0
      Iustin Pop authored
      This patch adds a wrapper over the block device grow operation that
      converts the input and output parameters as needed for the rpc layer.
      
      Reviewed-by: imsnah
      594609c0
    • Iustin Pop's avatar
      Add migration support at the rpc layer · 2a10865c
      Iustin Pop authored
      This patch adds the migration rpc call and its implementation in the
      backend. The patch does not deal with the correct activation of disks.
      
      Because of the new RPC, the protocol version is increased.
      
      Reviewed-by: imsnah
      2a10865c
  7. 13 May, 2008 2 commits
    • Iustin Pop's avatar
      Implement node daemon conectivity tests · 9d4bfc96
      Iustin Pop authored
      This patch adds in gnt-cluster verify checks for inter-node tcp
      communication checks on the node daemon port for both the primary and
      (if defined) secondary networks.
      
      The output looks like (4-node cluster, one with the secondary interface
      down):
      * Verifying node node1.example.com
        - ERROR: tcp communication with node 'node3.example.com': failure using the secondary interface(s)
      * Verifying node node2.example.com
        - ERROR: tcp communication with node 'node3.example.com': failure using the secondary interface(s)
      * Verifying node node3.example.com
        - ERROR: tcp communication with node 'node1.example.com': failure using the secondary interface(s)
        - ERROR: tcp communication with node 'node2.example.com': failure using the secondary interface(s)
        - ERROR: tcp communication with node 'node4.example.com': failure using the secondary interface(s)
      * Verifying node node4.example.com
        - ERROR: tcp communication with node 'node3.example.com': failure using the secondary interface(s)
      
      Reviewed-by: imsnah
      9d4bfc96
    • Iustin Pop's avatar
      Reduce chance of ssh failures in verify cluster · b544cfe0
      Iustin Pop authored
      The cluster verify builds a sorted list of nodes and passes that to all
      the nodes (in parallel) for ssh checks. This means that for a cluster
      with N nodes, there will be approximately N simultaneous connections to
      the first node, then to the second node, etc. This, coupled with the
      ssh daemon's “MaxStartups” parameter, can create false alarms about ssh
      connectivity.
      
      This patch randomizes the node list in the backend (therefore, each node
      should have it's own order of ssh-ing to the other nodes) and the chance
      of these alarms should be reduced.
      
      Reviewed-by: ultrotter
      b544cfe0
  8. 30 Apr, 2008 1 commit
  9. 28 Apr, 2008 1 commit
    • Iustin Pop's avatar
      Move iallocator script execution to ganeti-noded · 8d528b7c
      Iustin Pop authored
      Currently the iallocator execution takes place in the master, which is a
      violation of the current architecture, and will create problems with a
      threaded master daemon.
      
      This patch moves the execution to the backend, similar to the hooks
      runner, by:
        - introducing a new class that handles the execution in the backend
          (and could be used also for listing the allocators, etc.)
        - introducing a new rpc call
        - replacing the actual execution in IAllocator.Run() with a rpc call
      
      This passes burnin with the dumb allocator
      
      Reviewed-by: imsnah
      8d528b7c
  10. 24 Apr, 2008 1 commit
  11. 10 Apr, 2008 2 commits
    • Iustin Pop's avatar
      Move the OS search code into an abstract function · 57c177af
      Iustin Pop authored
      Based on the previous OS search code changes, we can now move the OS
      search code into a generic look-for-file function in utils.py. This
      means that the allocator code can use the same function.
      
      Reviewed-by: ultrotter
      57c177af
    • Iustin Pop's avatar
      Change backend._OSSearch return values · c34c0cfd
      Iustin Pop authored
      Currently, the function backend._OSSearch() returns the (first) base dir
      in which this OS can be found. Thereafter the full actual path to the OS
      dir is built in the backend.OSFromDisk() function.
      
      This patch changes this so that _OSSearch() always returns the full path
      to the OS directory, and OSFromDisk uses that as returned (it will only
      build it if it gets a base dir in the first place).
      
      This patch is needed before we can abstract the _OSSearch into a generic
      'look for file object' functionality that can be used for allocator
      plugins search too.
      
      Reviewed-by: ultrotter
      c34c0cfd
  12. 05 Apr, 2008 1 commit
    • Manuel Franceschini's avatar
      Backend directory functions for file backend · 778b75bb
      Manuel Franceschini authored
      Add _[Create,Remove,Rename]FileStorageDir function which are needed for
      file-based instance management. These function check whether the given
      directory to operate on is under the cluster-wide defined default file
      storage dir. If this is not the case the won't do anything and return
      False. This is to prevent cluster manipulation or damage.
      
      Reviewed-by: ultrotter
      778b75bb
  13. 18 Mar, 2008 1 commit
  14. 05 Mar, 2008 1 commit
  15. 29 Feb, 2008 1 commit
    • Iustin Pop's avatar
      Fix master role stop on cluster destroy · c9064964
      Iustin Pop authored
      Currently the cluster destroy doesn't remove the master role, which
      means that the IP address of the cluster remains assigned to the master
      node.
      
      This patch fixes this and also a docstring in backend.StopMaster().
      
      Reviewed-by: imsnah
      c9064964
  16. 22 Feb, 2008 2 commits
  17. 14 Feb, 2008 1 commit
    • Iustin Pop's avatar
      Alter the device activation code · 40a03283
      Iustin Pop authored
      This tiny patch fixes the breakage that the previous patch about
      activation did by removing the Close() call after activation.
      
      The initial reason for that call was that if the device is already
      active and open, but we need it closed, we close it automatically.
      
      This however conflicts with the 2-step open in the case the instance is
      already open.
      
      It makes sense to remove the call since in the current Ganeti setup,
      just doing Close() is not enough to change the device from (e.g.)
      primary to secondary, as some devices (e.g. md) might need Shutdown not
      Close.
      
      It also gets rid of a Close() in the CreateBlockDevice function, due to
      the same reasoning (although in Create the child should not have a
      different status anyway).
      
      Reviewed-by: imsnah
      40a03283
  18. 30 Jan, 2008 1 commit
    • Guido Trotter's avatar
      Export bridge information too · 1cafd236
      Guido Trotter authored
      gnt-backup export used to export the ip and mac of each nic, but not which
      bridge it was connected to. Adding this information.
      
      Reviewed-by: iustinp
      
      1cafd236
  19. 21 Jan, 2008 1 commit
    • Iustin Pop's avatar
      Fix VG listing broken by r510 · d87ae7d2
      Iustin Pop authored
      LVM code sometimes adds an extra separator at the end of the field list.
      Make the code strip it if exists.
      
      Reviewed-by: imsnah
      d87ae7d2
  20. 20 Jan, 2008 2 commits
    • Iustin Pop's avatar
      Make backend._GetVGInfo check the validity of 'vgs' · f4d377e7
      Iustin Pop authored
      Currently, the function backend._GetVGInfo only checks for errors via
      the exit code of the 'vgs' command. However, there are other ways of
      failure so we need to also check for valid output before parsing.
      
      Furthermore, the checks on the exit code were reported via a 'raise
      LVMError', however this exception is not handled anywhere and so the
      remote caller will not get reasonable data.
      
      This patch does two main things:
        - change the calling protocol for this function to not raise an error,
          and instead return the same type of argument always (dict) with the
          requested keys but values changed into None; this allows in the
          parent rpc call node_info to have valid memory information but
          "error" value for disk space, if there's an error with disks
        - check the validity of the output so that in case we fail to parse
          it, we don't abort with a backtrace in the node daemon but instead
          return the default result value (containing errors), and log these
          cases in the node daemon log file
      
      We also bump the protocol version to 11.
      
      Reviewed-by: ultrotter
      f4d377e7
    • Iustin Pop's avatar
      Change a hardcoded path into its proper constant · 97628462
      Iustin Pop authored
      The function backend.UploadFile still uses "/etc/hosts" directly instead
      of the existing constant; this patch fixes this.
      
      Reviewed-by: ultrotter
      97628462
  21. 16 Jan, 2008 1 commit
  22. 07 Jan, 2008 1 commit
    • Iustin Pop's avatar
      Improve verify-disks: broken/missing LV detection · b63ed789
      Iustin Pop authored
      This patch improves the ‘gnt-cluster verify-disks’ command by adding
      support for detecting broken volume groups and missing logical volume
      names.
      
      As such, we don't try anymore to activate disks for instances that are
      not likely to succeed anyway, and instead report them.
      
      Reviewed-by: schreiberal
      b63ed789
  23. 11 Dec, 2007 1 commit
    • Iustin Pop's avatar
      Return more data in rpc.call_volume_list · cb2037a2
      Iustin Pop authored
      Currently, the volume_list call returns only the volume size. However,
      it is useful to also have two other things: the 'inactive' state of the
      volume (which might trigger a ‘vgchange -a y’ on the volume group) and
      the online state (which shows if the volume is in use or not).
      
      Since this modifies an RPC call, we also bump the protocol version,
      although the single user of the call didn't care about the dictionary
      values, only about the keys.
      
      Reviewed-by: imsnah
      cb2037a2
  24. 04 Dec, 2007 3 commits
  25. 03 Dec, 2007 1 commit
  26. 14 Nov, 2007 1 commit
    • Guido Trotter's avatar
      When an assembly error occurs log it too · 20a0c9ef
      Guido Trotter authored
      Right now an assembly error produces an exception but not a log message. This
      is bad because the exception suggests looking at the log, but the log itself
      has a lot of errors which are not really a problem and only some which really
      is. In order to make it clear where in the log the problem occurred we log a
      message too, before raising the exception.
      
      Reviewed-by: iustinp
      20a0c9ef
  27. 12 Nov, 2007 1 commit
    • Iustin Pop's avatar
      Fix a wrong comparison in _RecursiveAssembleBD · 7803d4d3
      Iustin Pop authored
      We want to prevent sending too many 'None' children to a device.
      However, the test as it is today is wrong, as we want to test the
      situation after adding a new child, and not before. This patch fixes
      this by testing greater-or-equal instead of just greater.
      
      Reviewed-by: imsnah
      7803d4d3
  28. 09 Nov, 2007 1 commit
  29. 07 Nov, 2007 1 commit
    • Iustin Pop's avatar
      Enhance secondary node replace for drbd8 · 0834c866
      Iustin Pop authored
      This (big) patch does two things:
        - add "local disk status" to the block device checks
          (BlockDevice.GetSyncStatus and the rpc calls that call this
          function, and therefore cmdlib._CheckDiskConsistency)
        - improve the drbd8 secondary replace operation using the above
          functionality
      
      The "local disk status" adds a new variable to the result of
      GetSyncStatus that shows the degradation of the local storage of the
      device. Of course, not all device support this - for now, we only modify
      LogicalVolumes and DRBD8 to return degraded in some cases, other devices
      always return non-degraded. This variable should be a subset of
      is_degraded - whenever this variable is true, the is_degraded should
      also be true.
      
      The drbd8 secondary replace uses this variable as we don't care if the
      primary drbd device is network-degraded, only if it has good local disk
      data (ldisk is False).
      
      The patch also increases the protocol version (due to rpc changes).
      
      Reviewed-by: imsnah
      0834c866