Skip to content
Snippets Groups Projects
  1. Oct 16, 2008
    • Iustin Pop's avatar
      Fix job queue behaviour when loading jobs · 94ed59a5
      Iustin Pop authored
      Currently, if loading a job fails, the job queue code raises an
      exception and prevents the proper processing of the jobs in the queue.
      We change this so that unparseable jobs are instead archived (if not
      already).
      
      Reviewed-by: imsnah
      94ed59a5
    • Iustin Pop's avatar
      Add an interface for the drain flag changes/query · 3ccafd0e
      Iustin Pop authored
      This adds the set/reset in the jqueue and luxi modules, and a way to
      query it in OpQueryConfigValues, and also the comand line interface for
      it:
      $ gnt-cluster queue info
      The drain flag is unset
      $ gnt-cluster queue drain
      $ gnt-cluster queue info
      The drain flag is set
      $ gnt-cluster queue undrain
      $ gnt-cluster queue info
      The drain flag is unset
      
      The choice of making the setting via luxi and not an opcode is that
      opcodes can't be executed when drained, but we don't query via luxi
      since in the future it might become a cluster property as opposed to a
      node one.
      
      Reviewed-by: imsnah
      3ccafd0e
  2. Oct 15, 2008
    • Iustin Pop's avatar
      Implement the job queue drain flag · 686d7433
      Iustin Pop authored
      We add a (per-node) queue drain flag that blocks new job submission.
      There is not yet an interface to add/remove the flag (will come in next
      patches).
      
      Reviewed-by: imsnah
      686d7433
  3. Oct 10, 2008
    • Iustin Pop's avatar
      Convert rpc module to RpcRunner · 72737a7f
      Iustin Pop authored
      This big patch changes the call model used in internode-rpc from
      standalong function calls in the rpc module to via a RpcRunner class,
      that holds all the methods. This can be used in the future to enable
      smarter processing in the RPC layer itself (some quick examples are not
      setting the DiskID from cmdlib code, but only once in each rpc call,
      etc.).
      
      There are a few RPC calls that are made outside of the LU code, and
      these calls are left as staticmethods, so they can be used without a
      class instance (which requires a ConfigWriter instance).
      
      Reviewed-by: imsnah
      72737a7f
  4. Oct 07, 2008
    • Iustin Pop's avatar
      Implement job 'waiting' status · e92376d7
      Iustin Pop authored
      Background: when we have multiple jobs in the queue (more than just a
      few), many of the jobs (up to the number of threads) will be in state
      'running', although many of them could be actually blocked, waiting for
      some locks. This is not good, as one cannot easily see what is
      happening.
      
      The patch extends the opcode/job possible statuses with another one,
      waiting, which shows that the LU is in the acquire locks phase. The
      mechanism for doing so is simple, we initialize (in the job queue) the
      opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
      control to the LU's Exec, it will call a notifier back into the
      _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
      the proper queue locking). Because this mechanism does not save the job,
      all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
      so we also change the load sequence to consider WAITLOCK as RUNNING.
      
      With the patch applied, creating in parallel (via burnin) five instances
      on a five node cluster shows that only two are executing, while three
      are waiting for locks.
      
      Reviewed-by: imsnah
      e92376d7
  5. Oct 06, 2008
    • Iustin Pop's avatar
      Implement job auto-archiving · 07cd723a
      Iustin Pop authored
      This patch adds a new luxi call that implements auto-archiving of jobs
      older than a certain age (or -1 for all completed jobs), and the gnt-job
      command that makes use of this (with 'all' for -1).
      
      Reviewed-by: imsnah
      07cd723a
    • Iustin Pop's avatar
      Increase the number of threads to 25 · 1daae384
      Iustin Pop authored
      Since our locks are not gathered nicely, we can have jobs that are
      actually blocking on locks (parallel burnin shows this), so at least we
      need to increase the number of threads above the usual number of jobs we
      could have in a such a case.
      
      Reviewed-by: imsnah
      1daae384
  6. Sep 30, 2008
    • Iustin Pop's avatar
      Enhance the job-related timestamps · c56ec146
      Iustin Pop authored
      This patch adds start, stop, and received timestamp for jobs (and allows
      querying of them), and allows querying of the opcode timestamps.
      
      Reviewed-by: imsnah
      c56ec146
  7. Sep 29, 2008
    • Iustin Pop's avatar
      Add opcode execution log in job info · 5b23c34c
      Iustin Pop authored
      This patch adds the job execution log in “gnt-job info” and also allows
      its selection in “gnt-job list” (however here it's not very useful as
      it's not easy to parse). It does this by adding a new field in the query
      job call, named ‘oplog’.
      
      With this, one can get a very clear examination of the job. What remains
      to be added would be timestamps for start/stop of the processing for the
      job itself and its opcodes.
      
      Reviewed-by: imsnah
      5b23c34c
    • Iustin Pop's avatar
      Implement job summary in gnt-job list · 60dd1473
      Iustin Pop authored
      It is not currently possibly to show a summary of the job in the output
      of “gnt-job list”. The closes is listing the whole opcode(s), but that
      is too verbose. Also, the default output (id, status) is not very
      useful, unless one looks for (and knows about) an exact job ID.
      
      The patch adds a “summary” description of a job composed of the list of
      OP_ID of the individual opcodes. Moreover, if an opcode has a ‘logical’
      target in a certain opcode field (e.g. start instance has the instance
      name as the target), then it is included in the formatting also. It's
      easier to explain via a sample output:
      
      gnt-job list
      ID Status  Summary
      1  error   NODE_QUERY
      2  success NODE_ADD(gnta2)
      3  success CLUSTER_QUERY
      4  success NODE_REMOVE(gnta2.example.com)
      5  error   NODE_QUERY
      6  success NODE_ADD(gnta2)
      7  success NODE_QUERY
      8  success OS_DIAGNOSE
      9  success INSTANCE_CREATE(instance1.example.com)
      10 success INSTANCE_REMOVE(instance1.example.com)
      11 error   INSTANCE_CREATE(instance1.example.com)
      12 success INSTANCE_CREATE(instance1.example.com)
      13 success INSTANCE_SHUTDOWN(instance1.example.com)
      14 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
      15 error   INSTANCE_CREATE(instance2.example.com)
      16 error   INSTANCE_CREATE(instance2.example.com)
      17 success INSTANCE_CREATE(instance2.example.com)
      18 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
      19 success INSTANCE_ACTIVATE_DISKS(instance2.example.com)
      20 success INSTANCE_SHUTDOWN(instance1.example.com)
      21 success INSTANCE_SHUTDOWN(instance2.example.com)
      
      This is done by a simple change to the opcode classes, which allows an
      opcode to format itself. The additional function is small enough that it
      can go in opcodes.py, where it could also be used by a client if needed.
      
      Reviewed-by: imsnah
      60dd1473
    • Iustin Pop's avatar
      Nicely sort the job list · 3b87986e
      Iustin Pop authored
      Unless we decide to change the job identifiers to integer, we should at
      least sort the list returned by _GetJobIDsUnlocked.
      
      Reviewed-by: imsnah
      3b87986e
  8. Sep 10, 2008
  9. Aug 29, 2008
    • Iustin Pop's avatar
      Make WaitForJobChanges deal with long jobs · 5c735209
      Iustin Pop authored
      This patch alters the WaitForJobChanges luxi-RPC call to have a
      configurable timeout, so that the call behaves nicely with long jobs
      that have no update.
      
      We do this by adding a timeout parameter in the RPC call, and returning
      a special constant when the timeout is reached without an update. The
      luxi client will repeatedly call the WaitForJobChanges until it gets a
      real change. The timeout is hardcoded as half the RWTO value.
      
      The patch also removes an unused variable (new_state) from the
      WaitForJobChanges method.
      
      Reviewed-by: imsnah,ultrotter
      5c735209
  10. Aug 27, 2008
    • Michael Hanselmann's avatar
      jqueue: Replace normal cache dict with weakref dict · 5685c1a5
      Michael Hanselmann authored
      A job should only exist once in memory. After the cache is cleaned,
      there can still be references to a job somewhere else. If there
      are multiple instances, one can get updated while a function is
      waiting for changes on another instance. By using
      weakref.WeakValueDictionary, which automatically removes instances as
      soon as there are no strong references to it anymore, we can solve
      this problem.
      
      Reviewed-by: iustinp
      5685c1a5
    • Michael Hanselmann's avatar
      jqueue: Keep timestamp of opcode start and end · 70552c46
      Michael Hanselmann authored
      Reviewed-by: ultrotter
      70552c46
    • Michael Hanselmann's avatar
      jqueue: Reset run_op_idx after job is done · 65548ed5
      Michael Hanselmann authored
      It can be confusing otherwise.
      
      Reviewed-by: ultrotter
      65548ed5
    • Michael Hanselmann's avatar
      Make sure that client programs get all messages · 6c5a7090
      Michael Hanselmann authored
      This is a large patch, but I can't figure out how to split it without
      breaking stuff. The old way of getting messages by always getting the
      last one didn't bring all messages to the client if they were added
      too fast, thereby making commands like “gnt-cluster verify” less than
      useful. These changes now introduce some sort a serial number per
      log entry to keep track what message a client already received. They
      also remove the log lock per opcode to make reading log entries thread
      safe.
      
      Reviewed-by: ultrotter
      6c5a7090
  11. Aug 11, 2008
  12. Aug 08, 2008
  13. Aug 06, 2008
  14. Aug 05, 2008
  15. Aug 04, 2008
  16. Jul 31, 2008
  17. Jul 30, 2008
    • Iustin Pop's avatar
      Fix pylint-detected issues · 38206f3c
      Iustin Pop authored
      This is mostly:
        - whitespace fix (space at EOL in some files, not all, broken
          indentation, etc)
        - variable names overriding others (one is a real bug in there)
        - too-long-lines
        - cleanup of most unused imports (not all)
      
      Reviewed-by: ultrotter
      38206f3c
    • Michael Hanselmann's avatar
      Rewrite job queue · 85f03e0d
      Michael Hanselmann authored
      We found several issues in the old job queue implementation. It had race
      conditions, deadlocks and other deficiencies.
      
      Short summary:
      - _QueuedOpCode and _QueuedJob are now more or less data structures with a few
        utility functions. __Setup is gone.
      - DiskJobStorage and JobQueue classes merged into one to reduce code complexity.
      - One lock in JobQueue for almost everything. There's also a lock per opcode
        for log messages.
      
      Reviewed-by: iustinp
      85f03e0d
  18. Jul 29, 2008
  19. Jul 28, 2008
  20. Jul 25, 2008
  21. Jul 24, 2008
  22. Jul 23, 2008
Loading