1. 18 Jun, 2009 1 commit
  2. 15 Jun, 2009 1 commit
  3. 27 May, 2009 1 commit
    • Iustin Pop's avatar
      Add a node powercycle command · f5118ade
      Iustin Pop authored
      
      
      This (somewhat big) patch adds support for remotely rebooting the nodes
      via whatever support the hypervisor has for such a concept.
      
      For KVM/fake (and containers in the future) this just uses sysrq plus a
      ‘reboot’ call if the sysrq method failed. For Xen, it first tries the
      above, and then Xen-hypervisor reboot (we first try sysrq since that
      just requires opening a file handle, whereas xen reboot means launching
      an external utility).
      
      The user interface is:
      
          # gnt-node powercycle node5
          Are you sure you want to hard powercycle node node5?
          y/[n]/?: y
          Reboot scheduled in 5 seconds
      
      The node reboots hopefully after sending the reply. In case the clock is
      broken, “time.sleep(5)” might take ages (but then I suspect SSL
      negotiation wouldn't work).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f5118ade
  4. 03 Feb, 2009 1 commit
    • Iustin Pop's avatar
      An attempt at fixing some encoding issues · 26f15862
      Iustin Pop authored
      This patch unifies the hardcoded re-encoding attempts into a single
      function in utils.py. This function is used to take either an unicode or
      str object and convert it to a ASCII-only str object which can be safely
      displayed and transmitted.
      
      We replace then the current manual re-encodings with this function. In
      mcpu we stop re-encoding the hooks output and instead we do it right at
      the hook generation in backend.py.
      
      This passes on my 'custom' lvs output with non-ASCII chars. But there
      are probably other places we will need to fix.
      
      Reviewed-by: ultrotter
      26f15862
  5. 13 Jan, 2009 1 commit
    • Iustin Pop's avatar
      Forward port the live migration from 1.2 branch · 53c776b5
      Iustin Pop authored
      This is forward port via copy (and not individual patches cherry-pick)
      of the latest code on the 1.2 branch related to the migration.
      
      The changes compared to 1.2 are the fact that we don't need the
      IdentifyDisks step anymore (the drbd rpc calls are independent now), and
      the rpc module improvements.
      
      Reviewed-by: ultrotter
      53c776b5
  6. 12 Jan, 2009 1 commit
  7. 05 Dec, 2008 1 commit
    • Iustin Pop's avatar
      Make cluster verify understand offline nodes · 0a66c968
      Iustin Pop authored
      This patch changes cluster verify to not alert on offline nodes, but
      instead just show a note at the end with the number of such nodes.
      
      It also removes warnings in verify-disks and hooks about failures to
      make rpc calls to such nodes.
      
      Reviewed-by: ultrotter
      0a66c968
  8. 02 Dec, 2008 2 commits
    • Iustin Pop's avatar
      Convert rpc results to a custom type · 781de953
      Iustin Pop authored
      For a long time we had the problem that both RPC-layer errors and
      results from the remote node share the same "valuespace". This is
      because we shouldn't raise an exception when only one node failed
      (and lose the results from the other nodes).
      
      This patch attempts to address this problem by returning a special
      object from RPC calls, which separates the rpc-layer status and the
      remote results into different attributes.
      
      All the users of rpc (mainly cmdlib, but also bootstrap and the
      HooksMaster in mcpu) have been converted to this new model. The code has
      changed from, e.g. for boolean return types:
      
        if not self.rpc.call_...
      
      to
      
        result = self.rpc.call_
        if result.failed or not result.data:
           ^ rpc-layer error    |
                                - result payload
      
      While this is slightly more complicated, it will allow cleaner checks in
      the future; right now the code is just a plain port, without
      optimizations.
      
      There's also a "result.Raise()" which raises an OpExecError if the
      rpc-layer had errors.
      
      One side-effect of the patch is that now all return types from the
      rpc.call_ functions are of either RpcResult (single-node) or dicts of
      (node name, RpcResult); previously, some functions were returning
      different object types based on error status.
      
      The code passes burnin (after many retries :).
      
      Reviewed-by: imsnah
      781de953
    • Iustin Pop's avatar
      Add a gnt-node modify operation · b31c8676
      Iustin Pop authored
      This patch adds the OpCode, LogicalUnit and gnt-node command for
      modifying node parameters, more specifically the master candidate flag
      for a node.
      
      Reviewed-by: imsnah
      b31c8676
  9. 24 Nov, 2008 1 commit
  10. 21 Oct, 2008 1 commit
    • Iustin Pop's avatar
      Improve the mcpu.Processor logging routines · c0088fb9
      Iustin Pop authored
      As discussed previously, many of the routinges in cmdlib.py are using
      logging functions as a carry-over from 1.2 (when these also showed the
      message on stderr/to the user), instead of actually warning the user.
      
      This patch extends the syntax for Processor.LogInfo/LogWarning in order
      to be easier to use them.
      
      Reviewed-by: imsnah
      c0088fb9
  11. 20 Oct, 2008 1 commit
  12. 10 Oct, 2008 1 commit
    • Iustin Pop's avatar
      Convert rpc module to RpcRunner · 72737a7f
      Iustin Pop authored
      This big patch changes the call model used in internode-rpc from
      standalong function calls in the rpc module to via a RpcRunner class,
      that holds all the methods. This can be used in the future to enable
      smarter processing in the RPC layer itself (some quick examples are not
      setting the DiskID from cmdlib code, but only once in each rpc call,
      etc.).
      
      There are a few RPC calls that are made outside of the LU code, and
      these calls are left as staticmethods, so they can be used without a
      class instance (which requires a ConfigWriter instance).
      
      Reviewed-by: imsnah
      72737a7f
  13. 07 Oct, 2008 1 commit
    • Iustin Pop's avatar
      Implement job 'waiting' status · e92376d7
      Iustin Pop authored
      Background: when we have multiple jobs in the queue (more than just a
      few), many of the jobs (up to the number of threads) will be in state
      'running', although many of them could be actually blocked, waiting for
      some locks. This is not good, as one cannot easily see what is
      happening.
      
      The patch extends the opcode/job possible statuses with another one,
      waiting, which shows that the LU is in the acquire locks phase. The
      mechanism for doing so is simple, we initialize (in the job queue) the
      opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
      control to the LU's Exec, it will call a notifier back into the
      _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
      the proper queue locking). Because this mechanism does not save the job,
      all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
      so we also change the load sequence to consider WAITLOCK as RUNNING.
      
      With the patch applied, creating in parallel (via burnin) five instances
      on a five node cluster shows that only two are executing, while three
      are waiting for locks.
      
      Reviewed-by: imsnah
      e92376d7
  14. 01 Oct, 2008 3 commits
  15. 11 Sep, 2008 2 commits
    • Guido Trotter's avatar
      Implement adding/removal of locks by declaration · ca2a79e1
      Guido Trotter authored
      With this patch LUs can declare locks to be added when they start and/or
      removed after they finish. For now locks can only be added in the
      acquired state, and removed if owned, and added locks default to be
      removed again, unless some action is taken.
      
      Reviewed-by: imsnah
      ca2a79e1
    • Guido Trotter's avatar
      Use is_owned to determine whether to unlock · 80ee04a4
      Guido Trotter authored
      Now that is_owned is public we don't need to play games at the end of an
      LU. If we're still owning anything we just release it.
      
      Reviewed-by: imsnah
      80ee04a4
  16. 09 Sep, 2008 1 commit
    • Guido Trotter's avatar
      Processor: remove ChainOpCode · b2751b57
      Guido Trotter authored
      This function was incompatible with the new locking system, and its
      usage has been removed from the code. For now LUs share code by calling
      common module-private functions in cmdlib.py, in the future they will
      use tasklets (when those will be implemented).
      
      Reviewed-by: iustinp
      b2751b57
  17. 28 Aug, 2008 1 commit
    • Guido Trotter's avatar
      Fix issue when acquiring empty lock sets · 6683bba2
      Guido Trotter authored
      By design if an empty list of locks is acquired from a set, no locks are
      acquired, and thus release() cannot be called on the set. On the other
      hand if None is passed instead of the list, the whole set is acquired,
      and must later be released. When acquiring whole empty sets, a release
      must happen too, because the set-lock is acquired.
      
      Since we used to overwrite the required locks (needed_locks) with the
      acquired ones, we weren't able to distinguish the two cases (empty list
      of locks required, and all locks required, but an empty list returned
      because the set is empty). Valid solutions include:
        (1) forbidding the acquire of empty lists of locks
        (2) skipping the acquire/release on empty lists of locks
        (3) separating the to-acquire and the acquired list
      
      This patch implements the third approach, and thus LUs will find
      acquired locks in the acquired_locks dict, rather than in needed_locks.
      The LUs which used this feature before have been updated. This makes it
      easier because it doesn't force LUs to do more checks on corner cases,
      which are easily forgettable (1) and allows more flexibility if we want
      LUs to release (part-of) the locks (which is still a possibly scary
      operation, but anyway). This easily combines with (2) should we choose
      to implement it.
      
      Reviewed-by: imsnah
      6683bba2
  18. 18 Aug, 2008 1 commit
    • Guido Trotter's avatar
      Processor: lock all levels even if one is missing · 8a2941c4
      Guido Trotter authored
      If a locking level wasn't specified locking used to stop. This means
      that if one, for example, didn't specify anything at the LEVEL_INSTANCE
      level, no locks at the LEVEL_NODE level were acquired either. With this
      patch we force _LockAndExecLU to be called for all existing levels, and
      break the recursion if the level doesn't exist in locking.LEVELS.
      
      Reviewed-by: imsnah
      8a2941c4
  19. 30 Jul, 2008 5 commits
    • Guido Trotter's avatar
      ChainOpCode is still BGL-only · 64381ad7
      Guido Trotter authored
      Prevent mistakes with an assert.
      
      Reviewed-by: iustinp
      64381ad7
    • Iustin Pop's avatar
      Fix pylint-detected issues · 38206f3c
      Iustin Pop authored
      This is mostly:
        - whitespace fix (space at EOL in some files, not all, broken
          indentation, etc)
        - variable names overriding others (one is a real bug in there)
        - too-long-lines
        - cleanup of most unused imports (not all)
      
      Reviewed-by: ultrotter
      38206f3c
    • Guido Trotter's avatar
      Make sharing locks possible · 3977a4c1
      Guido Trotter authored
      LUs can declare which locks they need by populating the
      self.needed_locks dictionary, but those locks are always acquired as
      exclusive. Make it possible to acquire shared locks as well, by
      declaring a particular level as shared in the self.share_locks
      dictionary. By default this dictionary is populated so that all locks
      are acquired exclusively.
      
      Reviewed-by: iustinp
      3977a4c1
    • Guido Trotter's avatar
      Add LogicalUnit.DeclareLocks · fb8dcb62
      Guido Trotter authored
      This additional LogicalUnit function is optional to implement, but lets
      you change your locking needs for one level just before locking it, but
      after the previous levels have been already locked. It is useful for
      example to calculate what nodes to lock after locking an instance.
      
      Reviewed-by: iustinp
      fb8dcb62
    • Iustin Pop's avatar
      Rework master startup/shutdown/failover · b1b6ea87
      Iustin Pop authored
      This (big) patch reworks the master startup/shutdown and the fixes the
      master failover.
      
      What does the patch do?
      
      For master start/stop:
        - remove the old ganeti-master script and its associated man page
        - moves the ip start/stop directly into the backend.(Start|Stop)Master
        - adds start/stop of the master/rapi daemon into these functions,
          selectively based on the start/stop arguments
        - makes the master call via rpc StartMaster(start_daemons=False) to
          the local node so that the master IP is started
        - and finally changes the example init.d script to directly start and
          stop all three daemons, since they do the right thing (depending on
          master/not master role)
      
      For master failover:
        - moves the code from LUMasterFailover into bootstrap.MasterFailover,
          since we need to start/stop the master during this operation and
          thus it can't be executed from the master
        - removes the LUMasterFailover and its associated opcode
      
      Notes: ubuntu's /etc/lsb-base-logging.sh is dumb, so the messages 'not
      master' are not seen during startup on non-master nodes.
      
      Reviewed-by: ultrotter
      b1b6ea87
  20. 23 Jul, 2008 1 commit
    • Guido Trotter's avatar
      Invert nodes/instances locking order · 04e1bfaf
      Guido Trotter authored
      An implementation mistake from the original design caused nodes to be
      locked before instances, rather than after. This patch inverts the level
      numbering, changing also the relevant unittests and the recursive
      locking function starting point.
      
      Reviewed-by: iustinp
      04e1bfaf
  21. 14 Jul, 2008 1 commit
    • Iustin Pop's avatar
      First version of user feedback fixes · f1048938
      Iustin Pop authored
      This patch contains a raw version for fixing feedback_fn.
      
      The new mechanism works as follows:
        - instead of a per-Processor feedback_fn, there's one for each
          ExecOpCode, so that feedback for different opcodes go via possibly
          different functions
        - each _QueuedOpCode gets a message buffer, a method for adding
          feedback and a method for retrieving (parts of) the feedback
        - the _QueuedJob object gets a new attribute that is equal to the
          index of the currently executing opcode
        - job queries get an extra parameter called 'ticker' that will return
          the latest message on the current executing opcode
        - the cli.py job completion poll will show the new status if different
          from the old one
      
      Of course, quick messages will be lost, as currently only the latest one
      is available. Also changes between opcodes are not represented at all.
      
      Reviewed-by: imsnah
      f1048938
  22. 08 Jul, 2008 4 commits
    • Guido Trotter's avatar
      Processor: Acquire locks before executing an LU · 68adfdb2
      Guido Trotter authored
      If we're running in a "new style" LU we may need some locks, as required
      by the ExpandNames function, to be able to run. We'll walk up the lock
      levels present in the needed_locks dictionary and acquire them, then run
      the actual LU. LUs can release some or all the acquired locks, if they
      want, before terminating, provided they update their needed_locks
      dictionary appropriately, so that we know not to release a level if they
      have already done so.
      
      Reviewed-by: iustinp
      68adfdb2
    • Guido Trotter's avatar
      LogicalUnit: add ExpandNames function · d465bdc8
      Guido Trotter authored
      New concurrent LUs will need to call ExpandNames so that any names
      passed in by the user are canonicalized, and can be used by hooks,
      locking and other parts of the code. This was done in CheckPrereq
      before, but it's now splitted out, as it's needed for locking, which in
      turn CheckPrereq needs. Old LUs can be converted gradually.
      
      Reviewed-by: iustinp
      d465bdc8
    • Guido Trotter's avatar
      Processor: Move LU execution to its own method · 36c381d7
      Guido Trotter authored
      This makes the try...finally code simplier, and helps adding a more
      complex locking structure before the actual execution. It also fixes a
      concurrency bug caused by the fact that write_count was read before
      acquiring the BGL, and thus spurious config update hooks run could have
      been triggered. This doesn't solve the issue of running config update
      hooks for concurrent LUs.
      
      Reviewed-by: iustinp
      36c381d7
    • Guido Trotter's avatar
      Pass context to LUs · 77b657a3
      Guido Trotter authored
      Rather than passing a ConfigWriter to the LUs we'll pass the whole
      context, from which a ConfigWriter can be extracted, but we can also
      access the GanetiLockManager. This also fixes the places where a FakeLU
      is created.
      
      Reviewed-by: iustinp
      77b657a3
  23. 01 Jul, 2008 3 commits
    • Guido Trotter's avatar
      Context: s/GLM/glm/ · 984f7c32
      Guido Trotter authored
      Make the GanetiLockManager instance of GanetiContext lowercase
      
      Reviewed-by: imsnah
      984f7c32
    • Guido Trotter's avatar
      Processor: acquire the BGL for LUs requiring it · 04864530
      Guido Trotter authored
      If a LU required the BGL (all LUs do, right now, by default) we'll
      acquire it in the Processor before starting them. For LUs that don't
      we'll still acquire it, but in a shared fashion, so that they cannot run
      together with LUs that do.
      
      We'll also note down whether we own the BGL exclusively, and if we don't
      and we try to chain a LU that does, we'll fail.
      
      More work will need to be done, of course, to convert LUs not to require
      the BGL, but this basic infrastructure should guarantee the coexistance
      of the old and new world for the time being.
      
      Reviewed-by: iustinp
      04864530
    • Guido Trotter's avatar
      Processor: pass context in and use it. · 1c901d13
      Guido Trotter authored
      The processor used to create a new ConfigWriter when it was initialized.
      We now have one in the context, so we'll just recycle it. First of all
      we'll pass the context in when creating a new Processor object, then
      we'll just use context.cfg, which is granted to be initialized, wherever
      we used self.cfg, and stop checking whether the config is already
      initialized or not.
      
      In the future the Processor will be able to use the context also to
      acquire the BGL for LUs that require it, and to push the context down to
      LUs that don't in order for them to manage their own locking.
      
      Reviewed-by: iustinp
      1c901d13
  24. 30 Jun, 2008 1 commit
    • Guido Trotter's avatar
      Fix sstore handling in Processor · c6868e1d
      Guido Trotter authored
      - no need to keep the sstore as an object member, remove it
      - don't reinitialize sstore only if self.cfg is None
          This is not an issue, as the Processor is recycled for every opcode,
          but in general we know that (a) we might need a different type of
          sstore for different opcodes and (b) initializating them is cheap
      - recreate sstore when chaining opcodes
          Without this fix chaining an opcode which requires a writable sstore
          to one which doesn't would fail. This doesn't happen today, but it's
          better to fix it anyway
      
      These changes are possible because nowadays all opcodes already require
      a working cluster/configuration.
      
      Reviewed-by: iustinp
      c6868e1d
  25. 23 Jun, 2008 1 commit
    • Iustin Pop's avatar
      Fix gnt-cluster “command” and “copyfile” · b3989551
      Iustin Pop authored
      Since the disabling of forking in the master daemon, the two ssh-based
      subcommands were not working anymore. However, there is no need at all
      for the commands to be run from the master daemon (permissions to read
      the cluster private ssh key notwithstanding), they can be run directly
      from the command line utilities.
      
      The patch removes the two opcodes OpRunClusterCommand and
      OpClusterCopyFile (and their associated LUs) and changes the code in
      ‘gnt-cluster’ to query the list of nodes and run directly the SshRunner
      over the list. As such, all forking is done from the gnt-cluster script,
      and the commands are working again.
      
      Reviewed-by: imsnah
      b3989551
  26. 17 Jun, 2008 1 commit
    • Iustin Pop's avatar
      Implement disk grow at LU level · 8729e0d7
      Iustin Pop authored
      This patch adds a new opcode and LU for growing an instance's disk.
      
      The opcode allows growing only one disk at time, and will throw an error
      if the operation fails midway (e.g. on the primary node after it has
      been increased on the secondary node). As such, it might actually leave
      different sized LVs on different nodes, but this will not create
      problems.
      
      Reviewed-by: imsnah
      8729e0d7
  27. 16 Jun, 2008 1 commit
    • Guido Trotter's avatar
      Move SetKey to WritableSimpleStore and use it · 05f86716
      Guido Trotter authored
      Before we used to be able to update SimpleStore by just calling SetKey, this
      feature is now moved to an external class, which inherits from it. In this
      patch the new WritableSimpleStore class is also put to use, in the LUs that
      need it. Rather than making each LU instantiate it, we have a new LogicalUnit
      flag REQ_WSSTORE which defaults to False, but when declared to be True asks the
      LogicalUnit to be initialized with a writeable version of the SimpleStore.
      LUMasterFailover and LURenameCluster are then changed to use it.
      
      InitCluster is also changed to instantiate a WritableSimpleStore, rather
      than a normal one.
      
      Reviewed-by: imsnah
      05f86716