1. 30 Jun, 2009 1 commit
  2. 29 Jun, 2009 2 commits
  3. 23 Jun, 2009 2 commits
  4. 17 Jun, 2009 3 commits
    • Iustin Pop's avatar
      Fix handling of 'vcpus' in instance list · c1ce76bb
      Iustin Pop authored
      
      
      Currently running “gnt-instance list -o+vcpus” fails with a cryptic message:
        Unhandled Ganeti error: vcpus
      
      This is due to multiple issues:
        - in some corner cases cmdlib.py raises an errors.ParameterError but
          this is not handled by cli.py
        - LUQueryInstances declares ‘vcpu’ as a supported field, but doesn't handle
          it, so instead of failing with unknown parameter, e.g.:
            Failure: prerequisites not met for this operation:
            Unknown output fields selected: vcpuscd
          it raises the ParameteError message
      
      This patch:
        - adds handling of 'vcpus' to LUQueryInstances
        - adds handling of the ParameterError exception to cli.py
        - changes the 'else: raise errors.ParameterError' in the field handling of
          LUQueryInstance to an assert, since it's a programmer error if we reached
          this step
      
      With this, a future unhandled parameter will show:
        gnt-instance list -o+vcpus
        Unhandled protocol error while talking to the master daemon:
        Caught exception: Declared but unhandled parameter 'vcpus'
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      c1ce76bb
    • Iustin Pop's avatar
      Fix checking for valid OS in instance create · 6dfad215
      Iustin Pop authored
      
      
      The current check in LUCreateInstance.CheckPrereq() is wrong - it only checks
      if we got an OS, but not if we got a valid OS. This patch fixes it.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      6dfad215
    • Iustin Pop's avatar
      Show disk size in instance info · c98162a7
      Iustin Pop authored
      
      
      The size of the instance's disk was not shown in “gnt-instance info”.
      This patch adds it and formats it nicely if possible.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      c98162a7
  5. 16 Jun, 2009 6 commits
  6. 15 Jun, 2009 1 commit
  7. 11 Jun, 2009 1 commit
  8. 08 Jun, 2009 4 commits
    • Iustin Pop's avatar
      Enable stripped LVs · fecbe9d5
      Iustin Pop authored
      
      
      This patch enables stripped LVs, falling back to non-stripped if the
      stripped creation fails. If the configure-time lvm-stripecount is 1,
      this patch becomes a noop (with an insignificant python-level overhead,
      but no extra lvm calls).
      
      The effect of this patch is that new instances will get stripped LVs
      from the start, whereas old instances will have their LVs stripped as
      soon as replace-disks is run for them.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      fecbe9d5
    • Iustin Pop's avatar
      Add a lvm stripecount configure parameter · 3736cb6b
      Iustin Pop authored
      
      
      This patch adds a configure-time customizable parameter that will be
      used to enable stripped LVs. The default of the parameter is 3.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      3736cb6b
    • Iustin Pop's avatar
      Add more constants for DRBD and change sync tests · 3c003d9d
      Iustin Pop authored
      
      
      This patch adds constants for the connection status, peer roles and disk
      status, and it changes the rules for when the disk is considered as
      “resyncing” - previously it was only for syncsource/synctarget, but
      there are many other transient statuses which could be misinterpreted as
      ‘degraded’ (because they where not considered as resyncing, but the disk
      is not consistent in these statuses).
      
      Furthermore, cmdlib.py:WaitForSync determines if a device is syncing or
      not based on sync_percent being not none. Not all DRBD resync statuses
      offer a percent done, so if we are syncing but don't have a sync
      percent, we'll report a zero sync percent (and no time estimate).
      
      The patch also removes a few unused variables (is_sync_target,
      peer_sync_target, is_resync) whose value doesn't make sense anymore with
      the new sync rules.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      3c003d9d
    • Iustin Pop's avatar
      Merge branch 'master' into next · 5ce92cd3
      Iustin Pop authored
      * master:
        Wait for a while in failed resyncs
        Fix two issues with exports and snapshot errors
      5ce92cd3
  9. 04 Jun, 2009 1 commit
    • Iustin Pop's avatar
      Wait for a while in failed resyncs · fbafd7a8
      Iustin Pop authored
      
      
      This patch is an attempt at fixing some very rare occurrences of messages like:
        - "There are some degraded disks for this instance", or:
        - "Cannot resync disks on node node3.example.com: [True, 100]"
      
      What I believe happens is that drbd has finished syncing, but not all
      fields are updated in 'Connected' state; maybe it's in WFBitmap[ST], or
      in some other transient state we don't handle well.
      
      The patch will change the _WaitForSync method to recheck up to a
      hardcoded number of times if we're finished syncing but we're degraded
      (using the same condition as the 'break' clause of the loop).
      
      The cons of this changes is that a normal, really-degraded due to
      network or disk failure will cause an extra delay before it aborts. For
      this, I'm happy to choose other values.
      
      A better, long term fix is to handle more DRBD state correctly (see the
      bdev.DRBD8Status class).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      fbafd7a8
  10. 03 Jun, 2009 2 commits
    • Iustin Pop's avatar
      Assemble DRBD using the known size · f069addf
      Iustin Pop authored
      
      
      This patch changes DRBD disk attachment to force the wanted size, as opposed to
      letting the device auto-discover its size.
      
      This should make the disks more resilient with regard to small differences in
      size (e.g. due to LVM rounding). This still works with regard to disk
      growth, but the instances needs to be fully restarted (including disks)
      in that case.
      
      This passes a full burning without problems, but it's still a tricky
      change - if the config.data is not synced with the reality, we might
      tell DRBD a wrong size. At least this will fail outright (and not
      introduce silent errors), as DRBD (per a quick check at the sources)
      tracks the size in the meta-dev and also does not allow shrinking
      consistent devices.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f069addf
    • Iustin Pop's avatar
      Fix two issues with exports and snapshot errors · a97da6b7
      Iustin Pop authored
      
      
      This patch fixes two issues related to failed snapshots during exports:
        - first, the error messages used disk.logical_id[1], which is a node
          name for DRBD, and it resulted in strange error messages like
          "cannot snapshot block device node1 on node2"
        - second, if snapshotting fails for any disk, rpc.call_finalize_export
          fails as it didn't handle booleans (backend.FinalizeExport does)
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      a97da6b7
  11. 28 May, 2009 3 commits
    • Iustin Pop's avatar
      Set the size on new DRBDs in replace secondary · 8a6c7011
      Iustin Pop authored
      
      
      Currently the code in cmdlib doesn't set the device size to new DRBD
      devices in replace secondary, but we need to do it otherwise it gets
      initialized to None.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      8a6c7011
    • Iustin Pop's avatar
      Change the bdev init signatures · 464f8daf
      Iustin Pop authored
      
      
      This patch changes all the bdev.BlockDev constructors to take an
      additional ‘size’ parameter, all the backend functions that call those
      functions to pass it and also changes backend.BlocdevCreate() to not use
      the size passed via the rpc call but instead directly disk.size (this is
      the only way it's called).
      
      Note that this patch doesn't do anything with this parameter, just
      stores it on the blockdev objects.
      
      With the patch, we actually have a more uniform init sequence (before
      create had the parameter, but the other functions not).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      464f8daf
    • Iustin Pop's avatar
      Merge branch 'next' · 2cd855dd
      Iustin Pop authored
      
      
      * next: (34 commits)
        watcher: automatically restart noded/rapi
        watcher: handle full and drained queue cases
        rapi: rework error handling
        Fix backend.OSEnvironment be/hv parameters
        rapi: make tags query not use jobs
        Change failover instance when instance is stopped
        Export more instance information in hooks
        watcher: write the instance status to a file
        Fix the SafeEncoding behaviour
        Move more hypervisor strings into constants
        Add -H/-B startup parameters to gnt-instance
        call_instance_start: add optional hv/be parameters
        Fix gnt-job list argument handling
        Instance reinstall: don't mix up errors
        Don't check memory at startup if instance is up
        gnt-cluster modify: fix --no-lvm-storage
        LUSetClusterParams: improve volume group removal
        gnt-cluster info: show more cluster parameters
        LUQueryClusterInfo: return a few more fields
        Add the new DRBD test files to the Makefile
        ...
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      2cd855dd
  12. 27 May, 2009 1 commit
  13. 25 May, 2009 5 commits
    • Iustin Pop's avatar
      watcher: automatically restart noded/rapi · c4f0219c
      Iustin Pop authored
      
      
      This patch makes the watcher automatically restart the node and rapi
      daemons, if they are not running (as per the PID file).
      
      This is not an exhaustive test; a better one would be TCP connect to the
      port, and an even better one a simple protocol ping (e.g. get / for rapi
      and a rpc_call_alive for noded), but since we don't know how they've
      been started we can't implement it today. rapi would need to write the
      SSL/port to a file, and noded something similar, so that we know how to
      connect.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      c4f0219c
    • Iustin Pop's avatar
      watcher: handle full and drained queue cases · 24edc6d4
      Iustin Pop authored
      
      
      Currently the watcher is broken when the queue is full, thus not
      fulfilling its job as a queue cleaner. It also doesn't handle nicely the
      queue drained status.
      
      This patch does a few changes:
        - first archive jobs, and only after submit jobs; this fixes the case
          where the queue is already full and there are jobs suited for
          archiving (but not the case where the jobs all too young to be
          archived)
        - handle nicely the job queue full and drained cases—instead of
          tracebacks, log such cases nicely
        - reverse the initial value and special cases for update_file; we now
          whitelist instead of blacklist cases, since we have much more
          blacklist cases than vice versa, and we set the flag to True only
          after the run is successful
      
      The last change, especially, is a significant one: now errors during the
      watcher run will not update the status file, and thus they won't be lost
      again in the logs.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      24edc6d4
    • Iustin Pop's avatar
      rapi: rework error handling · 59b4eeef
      Iustin Pop authored
      
      
      Currently the rapi code doesn't have any custom error handling; any
      exceptions raised are simply converted into an HTTP 500 error, without
      much explanation.
      
      This patch adds a couple of generic SubmitJob/GetClient functions that
      handle some errors specially so that they are transformed into HTTP
      errors, with more detailed information.
      
      With this patch, the behaviour of rapi when the queue is full or
      drained, or when the master is down is more readable.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      59b4eeef
    • Iustin Pop's avatar
      Fix backend.OSEnvironment be/hv parameters · 030b218a
      Iustin Pop authored
      Commit 67fc3042
      
       added some more
      variables to be exported to OSEnvironment, but it has two bugs:
        - wrong variable name (env vs. result)
        - in OSEnvironment we don't have the automatic converstion to strings
          that we do in hooks, so we must manually enforce this
      
      With this patch instance creations work again.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      030b218a
    • Iustin Pop's avatar
      rapi: make tags query not use jobs · 25e39bfa
      Iustin Pop authored
      
      
      Currently the rapi tags query implementation is similar to the command
      line one: it submits OpGetTags jobs. This not good, since this being an
      API it can be used a lot and can pollute the job queue with many such
      trivial jobs.
      
      This patch converts it to use either queries (for nodes/instances) or
      direct read from ssconf (for the cluster case). For ssconf, we added a
      function to the ssconf.SimpleStore class for reading the tags.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      25e39bfa
  14. 21 May, 2009 2 commits
    • Iustin Pop's avatar
      Change failover instance when instance is stopped · d27776f0
      Iustin Pop authored
      
      
      Currently, if the instance is stopped, we still check for enough memory
      on the target node. This is a little bit too strict, since in case too
      many nodes have failed and one is out of the memory, this prevents
      fixing the cluster (with the instances down).
      
      We change it to do the memory checks only when the instance will be
      started.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      d27776f0
    • Iustin Pop's avatar
      Export more instance information in hooks · 67fc3042
      Iustin Pop authored
      
      
      Currently we miss in hooks the instance's hypervisor, hypervisor
      parameters and backend parameters. This forces hooks to query back into
      ganeti, which is dangerous due to possible luxi sockets exhaustion.
      
      This patch adds these three as INSTANCE_HYPERVISOR, INSTANCE_HV_*,
      INSTANCE_BE_*. The hook environment prefixes all keys with “GANETI”, so
      a default settings for a xen-pvm instance would be:
      
        GANETI_INSTANCE_HV_initrd_path=
        GANETI_INSTANCE_HV_kernel_args=ro
        GANETI_INSTANCE_HV_kernel_path=/boot/vmlinuz-2.6-xenU
        GANETI_INSTANCE_HV_root_path=/dev/sda1
      
      Any dashes in parameter names are changed to underscores, since
      variables with dashes are not easy to access from the shell
      (alternatively we could deny those via an unittest for constants.py).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      67fc3042
  15. 20 May, 2009 3 commits
  16. 19 May, 2009 3 commits
    • Iustin Pop's avatar
      Fix the SafeEncoding behaviour · d392fa34
      Iustin Pop authored
      
      
      Currently we have bad behaviour in SafeEncode:
        - binary strings are actually not handled correctly (ahem)
        - the encoding is not stable, due to use of string_escape
      
      For this reason, we replace the use of string_escape with part of the
      code of string escape (PyString_Repr in Objects/stringobject.c); we
      don't escape backslashes or single quotes, since that is that makes it
      nonstable. Furthermore, we only use the encode('ascii', ...) for unicode
      inputs.
      
      The patch also adds unittests for the function that test basic
      behaviour.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      d392fa34
    • Iustin Pop's avatar
      Move more hypervisor strings into constants · 835528af
      Iustin Pop authored
      
      
      This patch adds constants for the mouse and boot order strings; while
      there are still some issues remaining, we're trying to cleanup hardcoded
      strings from the hypervisors.
      
      Since the formatting of frozensets is currently wrong, we also add an
      utility function for this and change all the error messages to use it.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      835528af
    • Iustin Pop's avatar
      watcher: try to restart the master if down · 7dfb83c2
      Iustin Pop authored
      
      
      Bugs in either our code or in associated libraries can bring the master daemon
      down, and this (due to the 2.0 architecture) stops all work on the cluster.
      
      Since the watcher already does periodic checks on the cluster, we modify
      it to try to start the master automatically in case of failures to
      connect. This will be tried only once per cycle.
      
      Also, in this case, we modify the code so that the watcher status file
      is not updated - its timestamp will reflect thus the time of last
      successful connection to the master.
      
      Side note: the except errors.ConfigurationError part could be cleaned
      up, since in 2.0 we don't usually get that directly, and if we do it's
      an error and we shouldn't touch the file anyway; but that is not a rc5
      change.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      7dfb83c2