1. 30 Jun, 2009 1 commit
  2. 29 Jun, 2009 2 commits
  3. 17 Jun, 2009 3 commits
    • Iustin Pop's avatar
      Fix handling of 'vcpus' in instance list · c1ce76bb
      Iustin Pop authored
      
      
      Currently running “gnt-instance list -o+vcpus” fails with a cryptic message:
        Unhandled Ganeti error: vcpus
      
      This is due to multiple issues:
        - in some corner cases cmdlib.py raises an errors.ParameterError but
          this is not handled by cli.py
        - LUQueryInstances declares ‘vcpu’ as a supported field, but doesn't handle
          it, so instead of failing with unknown parameter, e.g.:
            Failure: prerequisites not met for this operation:
            Unknown output fields selected: vcpuscd
          it raises the ParameteError message
      
      This patch:
        - adds handling of 'vcpus' to LUQueryInstances
        - adds handling of the ParameterError exception to cli.py
        - changes the 'else: raise errors.ParameterError' in the field handling of
          LUQueryInstance to an assert, since it's a programmer error if we reached
          this step
      
      With this, a future unhandled parameter will show:
        gnt-instance list -o+vcpus
        Unhandled protocol error while talking to the master daemon:
        Caught exception: Declared but unhandled parameter 'vcpus'
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      c1ce76bb
    • Iustin Pop's avatar
      Fix checking for valid OS in instance create · 6dfad215
      Iustin Pop authored
      
      
      The current check in LUCreateInstance.CheckPrereq() is wrong - it only checks
      if we got an OS, but not if we got a valid OS. This patch fixes it.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      6dfad215
    • Iustin Pop's avatar
      Show disk size in instance info · c98162a7
      Iustin Pop authored
      
      
      The size of the instance's disk was not shown in “gnt-instance info”.
      This patch adds it and formats it nicely if possible.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      c98162a7
  4. 16 Jun, 2009 1 commit
  5. 04 Jun, 2009 1 commit
    • Iustin Pop's avatar
      Wait for a while in failed resyncs · fbafd7a8
      Iustin Pop authored
      
      
      This patch is an attempt at fixing some very rare occurrences of messages like:
        - "There are some degraded disks for this instance", or:
        - "Cannot resync disks on node node3.example.com: [True, 100]"
      
      What I believe happens is that drbd has finished syncing, but not all
      fields are updated in 'Connected' state; maybe it's in WFBitmap[ST], or
      in some other transient state we don't handle well.
      
      The patch will change the _WaitForSync method to recheck up to a
      hardcoded number of times if we're finished syncing but we're degraded
      (using the same condition as the 'break' clause of the loop).
      
      The cons of this changes is that a normal, really-degraded due to
      network or disk failure will cause an extra delay before it aborts. For
      this, I'm happy to choose other values.
      
      A better, long term fix is to handle more DRBD state correctly (see the
      bdev.DRBD8Status class).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      fbafd7a8
  6. 03 Jun, 2009 1 commit
  7. 28 May, 2009 1 commit
  8. 21 May, 2009 2 commits
    • Iustin Pop's avatar
      Change failover instance when instance is stopped · d27776f0
      Iustin Pop authored
      
      
      Currently, if the instance is stopped, we still check for enough memory
      on the target node. This is a little bit too strict, since in case too
      many nodes have failed and one is out of the memory, this prevents
      fixing the cluster (with the instances down).
      
      We change it to do the memory checks only when the instance will be
      started.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      d27776f0
    • Iustin Pop's avatar
      Export more instance information in hooks · 67fc3042
      Iustin Pop authored
      
      
      Currently we miss in hooks the instance's hypervisor, hypervisor
      parameters and backend parameters. This forces hooks to query back into
      ganeti, which is dangerous due to possible luxi sockets exhaustion.
      
      This patch adds these three as INSTANCE_HYPERVISOR, INSTANCE_HV_*,
      INSTANCE_BE_*. The hook environment prefixes all keys with “GANETI”, so
      a default settings for a xen-pvm instance would be:
      
        GANETI_INSTANCE_HV_initrd_path=
        GANETI_INSTANCE_HV_kernel_args=ro
        GANETI_INSTANCE_HV_kernel_path=/boot/vmlinuz-2.6-xenU
        GANETI_INSTANCE_HV_root_path=/dev/sda1
      
      Any dashes in parameter names are changed to underscores, since
      variables with dashes are not easy to access from the shell
      (alternatively we could deny those via an unittest for constants.py).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      67fc3042
  9. 19 May, 2009 3 commits
  10. 15 May, 2009 2 commits
  11. 13 May, 2009 2 commits
  12. 12 May, 2009 1 commit
  13. 07 May, 2009 1 commit
  14. 05 May, 2009 1 commit
    • Iustin Pop's avatar
      Fix argument checking in LUSetClusterParams · 3994f455
      Iustin Pop authored
      
      
      This patch fixes two issues with LUSetClusterParams and argument
      checking.
      
      First, this LU used the wrong function name (CheckParameters instead of
      CheckArguments), which means that no parameter checking was done at all;
      this impacted the candidate_pool_size checks (the only one done at this
      stage).
      
      Second, int() can raise both ValueError and TypeError, and we should
      correctly handle both.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      3994f455
  15. 04 May, 2009 1 commit
  16. 24 Apr, 2009 3 commits
    • Iustin Pop's avatar
      LUDiagnoseOS: change locking and error handling · a6ab004b
      Iustin Pop authored
      Since the “list OSes” call is exported via RAPI, this can be used pretty
      easily to DOS the master daemon during long jobs.
      
      The implementation of LUDiagnoseOS makes an RPC call to all nodes; we
      lock nodes here in order to prevent node removal.
      
      However, after closer examination, the worst case is:
        - we get the list of nodes from the config
        - another thread removes a node
        - our RPC queries reach the removed node
      
      As this point, if ganeti-noded is stopped or doesn't accept our queries,
      the RPC call will return failed, and in the current implementation all
      OSes will become invalid.
      
      If we change the ‘failed RPC’ handling to ignore such nodes, this allows
      us to both remove locking, and to handle transient RPC failures better
      (not invalidating all OSes).
      
      This patch does both these things, with a single drawback: in gnt-os
      diagnose, the down nodes do not appear at all. I think this is a small
      drawback, and the alternative is to add them with status failed; this
      works (3-line patch), but then the output of “list” and “diagnose” will
      no longer be consistent. As such, my proposal is to not list the nodes.
      
      Reviewed-by: ultrotter
      a6ab004b
    • Iustin Pop's avatar
      Fix verify-disks with broken volume groups · ea9ddc07
      Iustin Pop authored
      When a remote node returns invalid LVM data, we check it, but we don't
      stop and continue with the rest of the checks (which require a valid
      volume group). This raises an internal error and breaks verify disks.
      
      This seems unchanged for a long while, I don't know why it surfaced just
      recently.
      
      Reviewed-by: ultrotter
      ea9ddc07
    • Iustin Pop's avatar
      Prevent errors when xenvg is broken cluster verify · 9a198532
      Iustin Pop authored
      When vg_name is not returned at all, we currently abort with an internal
      error. This is because we don't catch KeyError.
      
      This patch adds a custom message for this case, and also adds KeyError
      to the list of catched exceptions, just for safety.
      
      On the other hand, we could also just remove this piece of code since
      it's not used at all the ["dfree"] value.
      
      Reviewed-by: ultrotter
      9a198532
  17. 15 Apr, 2009 1 commit
    • Iustin Pop's avatar
      A bunch of doc and other small fixes · 949bdabe
      Iustin Pop authored
      This patch adds a couple of both externally and internally reported
      issues:
        - missing SGML tags (Issue 54), report and patch by superdupont
        - wrong variable used in the init.d script, report and patch by
          Karsten Keil <karsten-keil@t-online.de>
        - man page for gnt-instance reinstall needs clarification (Issue 56)
        - gnt-instance man page missing --disks documentation for
          replace-disks
        - gnt-node modify help output is unclear about the -C/-D/-O input
          format, and the man page doesn't document this command at all
        - “gnt-node modify -C yes” for offline or drained nodes had wrong
          error message
        - “gnt-instance reinstall --select-os” has wrong prompt, we only
          accept a number for the OS and not the template name
      
      Reviewed-by: ultrotter
      949bdabe
  18. 09 Mar, 2009 2 commits
    • Iustin Pop's avatar
      Handle ghost instances in temp DRBD map · c614e5fb
      Iustin Pop authored
      Currently cluster-verify doesn't handle the (admitedly invalid) case where we
      have reservation for instances that were removed in the meantime.
      
      This patch adds a check for this and prevents code errors in cluster-verify in
      this case:
       * Verifying node node4.example.com (master candidate)
         - ERROR: ghost instance \'instance3.example.com\' in temporary DRBD map
      
      Reviewed-by: imsnah
      c614e5fb
    • Iustin Pop's avatar
      Fix error handling in replace-disks with new node · 82759cb1
      Iustin Pop authored
      Currently the _CreateSingleBlockDev function only raises OpExecError and not
      BlockDeviceError. This means that we don't release the instance's temporary
      minors properly, and this creates problems later if the instance is removed
      without master restart.
      
      We could just use OpExecError, but adding it and leaving
      BlockDeviceError in seems safer.
      
      Reviewed-by: imsnah
      82759cb1
  19. 02 Mar, 2009 2 commits
    • Iustin Pop's avatar
      Export tags to cluster verify hooks · 35e994e9
      Iustin Pop authored
      This patch export the cluster and node tags to the cluster verify hook
      scripts. The tags are exported as a space-separated list, which allows
      easy parsing from the shell (e.g. “for tag in $GANETI_CLUSTER_TAGS; do
      ...”) and therefore requires the previous “Don't allow spaces in tag
      names” patch.
      
      The patch also fixes a minor line length style problem.
      
      Reviewed-by: ultrotter
      35e994e9
    • Iustin Pop's avatar
      Update the iallocator documentation · 77031881
      Iustin Pop authored
      This updates the iallocator documentation to 2.0, bumps up the
      iallocator version (and moves a constants to lib/constants.py), and
      fixes a style on install.rst.
      
      Reviewed-by: ultrotter
      77031881
  20. 27 Feb, 2009 2 commits
    • Guido Trotter's avatar
      LUVerifyCluster: Handle the "no volume group" case · cc9e1230
      Guido Trotter authored
      If we're only file based and out volume group is set to "None" there's
      no point in asking nodes for their volume groups, logical volumes, and
      drbd devices, and checking those.
      
      Reviewed-by: iustinp
      cc9e1230
    • Iustin Pop's avatar
      Fix some epydoc style issues · 5fcc718f
      Iustin Pop authored
      99% of the epydoc return tags are "@return:", but each of the modified files
      had one "@returns:" line. We fix this for consistency.
      
      Reviewed-by: imsnah
      5fcc718f
  21. 25 Feb, 2009 1 commit
    • Iustin Pop's avatar
      Update some hooks settings · 2c2690c9
      Iustin Pop authored
      While reviewing the hooks document, I realised we are not correctly
      exporting the instance properties.
      
      This patch fixes:
        - export the disk and disk template in all LUs, not only (hardcoded)
          in the instance create
        - removes the instance create INSTANCE_ prefix on some non-instance
          variables (those are LU-related, not instance-related)
        - adds a couple of more variables to other LUs
      
      The hook document will be updated in a separate patch.
      
      Reviewed-by: ultrotter
      2c2690c9
  22. 24 Feb, 2009 2 commits
    • Iustin Pop's avatar
      Remove the extra_args parameter in instance start · 07813a9e
      Iustin Pop authored
      This patch removes the extra_args parameter and instead switches the
      instance to the HV_KERNEL_ARGS hypervisor option.
      
      This is a big change, but it's a needed cleanup, this extra parameter on
      all RPC calls is not generic and we also need to have a persistent value
      here.
      
      Reviewed-by: imsnah
      07813a9e
    • Iustin Pop's avatar
      Make gnt-instance info work with offline nodes · 9854f5d0
      Iustin Pop authored
      This simply makes LUQueryInstanceData return the same information as for
      a static query when one or both of the nodes are down.
      
      Reviewed-by: imsnah
      9854f5d0
  23. 16 Feb, 2009 2 commits
    • Iustin Pop's avatar
      Fix some bugs in reboot · ae48ac32
      Iustin Pop authored
      There are two issues fixed in this patch:
        - first, the recent RPC changes caused loss of data in hard reboot
          type; we weren't reporting any results from the stop/start instance
          calls;
        - second, in soft or hard reboots, we didn't initialized the disk
          physical ID; based on the last state of the instance's disks, this
          can create a failure in identifying the disks
      
      After this patch, burnin works again with reboot, and reports errors
      correctly.
      
      Reviewed-by: imsnah
      ae48ac32
    • Iustin Pop's avatar
      Convert IOErrors for /proc/drbd into our errors · f6eaed12
      Iustin Pop authored
      If /proc/drbd can't be opened, this raises an IOError, but all the
      error-handling behaviour in backend treats only BlockDeviceErrors. This
      creates a plain failure in cluster verify and in other RPC calls.
      
      This patch simply converts EnvironmentErrors into BlockDeviceErrors, and
      also changes the RPC result for NV_DRBDLIST and its handling to be able
      to show the error. The other RPC calls work by default now, due the
      existing error handling.
      
      Reviewed-by: ultrotter
      f6eaed12
  24. 13 Feb, 2009 1 commit
    • Guido Trotter's avatar
      SetInstanceParams: export nic changes to hooks · d8dcf3c9
      Guido Trotter authored
      Currently we export the old instance "as is" and any nic changes get
      lost, so hooks won't know of a different ip, bridge, or mac address.
      This patch fixes it by putting the nics in the override dict, if any
      changes are done.
      
      Reviewed-by: iustinp
      d8dcf3c9
  25. 12 Feb, 2009 1 commit
    • Guido Trotter's avatar
      LUSetInstanceParams: Fix nic handling · 5c44da6a
      Guido Trotter authored
      CheckArguments:
        Use constants.VALUE_NONE rather than hardcoding the string "none"
        If we're adding a nic fill the nic_dict with default values
        Check if the mac is syntactically valid, if we have one
        Don't allow the mac to be 'auto' when modifying a nic
      
      CheckPrereq:
        Check that bridge and mac if present in the dict are not None
          (before this wasn't handled at all)
        Generate the nic mac address here if demanded
      
      Exec:
        Do not generate nics and macs
      
      Reviewed-by: iustin
      5c44da6a