1. 07 Oct, 2008 1 commit
    • Iustin Pop's avatar
      Implement job 'waiting' status · e92376d7
      Iustin Pop authored
      Background: when we have multiple jobs in the queue (more than just a
      few), many of the jobs (up to the number of threads) will be in state
      'running', although many of them could be actually blocked, waiting for
      some locks. This is not good, as one cannot easily see what is
      happening.
      
      The patch extends the opcode/job possible statuses with another one,
      waiting, which shows that the LU is in the acquire locks phase. The
      mechanism for doing so is simple, we initialize (in the job queue) the
      opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
      control to the LU's Exec, it will call a notifier back into the
      _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
      the proper queue locking). Because this mechanism does not save the job,
      all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
      so we also change the load sequence to consider WAITLOCK as RUNNING.
      
      With the patch applied, creating in parallel (via burnin) five instances
      on a five node cluster shows that only two are executing, while three
      are waiting for locks.
      
      Reviewed-by: imsnah
      e92376d7
  2. 06 Oct, 2008 2 commits
    • Iustin Pop's avatar
      Implement job auto-archiving · 07cd723a
      Iustin Pop authored
      This patch adds a new luxi call that implements auto-archiving of jobs
      older than a certain age (or -1 for all completed jobs), and the gnt-job
      command that makes use of this (with 'all' for -1).
      
      Reviewed-by: imsnah
      07cd723a
    • Iustin Pop's avatar
      Increase the number of threads to 25 · 1daae384
      Iustin Pop authored
      Since our locks are not gathered nicely, we can have jobs that are
      actually blocking on locks (parallel burnin shows this), so at least we
      need to increase the number of threads above the usual number of jobs we
      could have in a such a case.
      
      Reviewed-by: imsnah
      1daae384
  3. 30 Sep, 2008 1 commit
    • Iustin Pop's avatar
      Enhance the job-related timestamps · c56ec146
      Iustin Pop authored
      This patch adds start, stop, and received timestamp for jobs (and allows
      querying of them), and allows querying of the opcode timestamps.
      
      Reviewed-by: imsnah
      c56ec146
  4. 29 Sep, 2008 3 commits
    • Iustin Pop's avatar
      Add opcode execution log in job info · 5b23c34c
      Iustin Pop authored
      This patch adds the job execution log in “gnt-job info” and also allows
      its selection in “gnt-job list” (however here it's not very useful as
      it's not easy to parse). It does this by adding a new field in the query
      job call, named ‘oplog’.
      
      With this, one can get a very clear examination of the job. What remains
      to be added would be timestamps for start/stop of the processing for the
      job itself and its opcodes.
      
      Reviewed-by: imsnah
      5b23c34c
    • Iustin Pop's avatar
      Implement job summary in gnt-job list · 60dd1473
      Iustin Pop authored
      It is not currently possibly to show a summary of the job in the output
      of “gnt-job list”. The closes is listing the whole opcode(s), but that
      is too verbose. Also, the default output (id, status) is not very
      useful, unless one looks for (and knows about) an exact job ID.
      
      The patch adds a “summary” description of a job composed of the list of
      OP_ID of the individual opcodes. Moreover, if an opcode has a ‘logical’
      target in a certain opcode field (e.g. start instance has the instance
      name as the target), then it is included in the formatting also. It's
      easier to explain via a sample output:
      
      gnt-job list
      ID Status  Summary
      1  error   NODE_QUERY
      2  success NODE_ADD(gnta2)
      3  success CLUSTER_QUERY
      4  success NODE_REMOVE(gnta2.example.com)
      5  error   NODE_QUERY
      6  success NODE_ADD(gnta2)
      7  success NODE_QUERY
      8  success OS_DIAGNOSE
      9  success INSTANCE_CREATE(instance1.example.com)
      10 success INSTANCE_REMOVE(instance1.example.com)
      11 error   INSTANCE_CREATE(instance1.example.com)
      12 success INSTANCE_CREATE(instance1.example.com)
      13 success INSTANCE_SHUTDOWN(instance1.example.com)
      14 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
      15 error   INSTANCE_CREATE(instance2.example.com)
      16 error   INSTANCE_CREATE(instance2.example.com)
      17 success INSTANCE_CREATE(instance2.example.com)
      18 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
      19 success INSTANCE_ACTIVATE_DISKS(instance2.example.com)
      20 success INSTANCE_SHUTDOWN(instance1.example.com)
      21 success INSTANCE_SHUTDOWN(instance2.example.com)
      
      This is done by a simple change to the opcode classes, which allows an
      opcode to format itself. The additional function is small enough that it
      can go in opcodes.py, where it could also be used by a client if needed.
      
      Reviewed-by: imsnah
      60dd1473
    • Iustin Pop's avatar
      Nicely sort the job list · 3b87986e
      Iustin Pop authored
      Unless we decide to change the job identifiers to integer, we should at
      least sort the list returned by _GetJobIDsUnlocked.
      
      Reviewed-by: imsnah
      3b87986e
  5. 10 Sep, 2008 1 commit
  6. 29 Aug, 2008 1 commit
    • Iustin Pop's avatar
      Make WaitForJobChanges deal with long jobs · 5c735209
      Iustin Pop authored
      This patch alters the WaitForJobChanges luxi-RPC call to have a
      configurable timeout, so that the call behaves nicely with long jobs
      that have no update.
      
      We do this by adding a timeout parameter in the RPC call, and returning
      a special constant when the timeout is reached without an update. The
      luxi client will repeatedly call the WaitForJobChanges until it gets a
      real change. The timeout is hardcoded as half the RWTO value.
      
      The patch also removes an unused variable (new_state) from the
      WaitForJobChanges method.
      
      Reviewed-by: imsnah,ultrotter
      5c735209
  7. 27 Aug, 2008 4 commits
    • Michael Hanselmann's avatar
      jqueue: Replace normal cache dict with weakref dict · 5685c1a5
      Michael Hanselmann authored
      A job should only exist once in memory. After the cache is cleaned,
      there can still be references to a job somewhere else. If there
      are multiple instances, one can get updated while a function is
      waiting for changes on another instance. By using
      weakref.WeakValueDictionary, which automatically removes instances as
      soon as there are no strong references to it anymore, we can solve
      this problem.
      
      Reviewed-by: iustinp
      5685c1a5
    • Michael Hanselmann's avatar
      jqueue: Keep timestamp of opcode start and end · 70552c46
      Michael Hanselmann authored
      Reviewed-by: ultrotter
      70552c46
    • Michael Hanselmann's avatar
      jqueue: Reset run_op_idx after job is done · 65548ed5
      Michael Hanselmann authored
      It can be confusing otherwise.
      
      Reviewed-by: ultrotter
      65548ed5
    • Michael Hanselmann's avatar
      Make sure that client programs get all messages · 6c5a7090
      Michael Hanselmann authored
      This is a large patch, but I can't figure out how to split it without
      breaking stuff. The old way of getting messages by always getting the
      last one didn't bring all messages to the client if they were added
      too fast, thereby making commands like “gnt-cluster verify” less than
      useful. These changes now introduce some sort a serial number per
      log entry to keep track what message a client already received. They
      also remove the log lock per opcode to make reading log entries thread
      safe.
      
      Reviewed-by: ultrotter
      6c5a7090
  8. 11 Aug, 2008 2 commits
  9. 08 Aug, 2008 3 commits
  10. 06 Aug, 2008 3 commits
  11. 05 Aug, 2008 1 commit
  12. 04 Aug, 2008 1 commit
  13. 31 Jul, 2008 2 commits
  14. 30 Jul, 2008 2 commits
    • Iustin Pop's avatar
      Fix pylint-detected issues · 38206f3c
      Iustin Pop authored
      This is mostly:
        - whitespace fix (space at EOL in some files, not all, broken
          indentation, etc)
        - variable names overriding others (one is a real bug in there)
        - too-long-lines
        - cleanup of most unused imports (not all)
      
      Reviewed-by: ultrotter
      38206f3c
    • Michael Hanselmann's avatar
      Rewrite job queue · 85f03e0d
      Michael Hanselmann authored
      We found several issues in the old job queue implementation. It had race
      conditions, deadlocks and other deficiencies.
      
      Short summary:
      - _QueuedOpCode and _QueuedJob are now more or less data structures with a few
        utility functions. __Setup is gone.
      - DiskJobStorage and JobQueue classes merged into one to reduce code complexity.
      - One lock in JobQueue for almost everything. There's also a lock per opcode
        for log messages.
      
      Reviewed-by: iustinp
      85f03e0d
  15. 29 Jul, 2008 1 commit
  16. 28 Jul, 2008 2 commits
  17. 25 Jul, 2008 1 commit
  18. 24 Jul, 2008 2 commits
  19. 23 Jul, 2008 6 commits
    • Michael Hanselmann's avatar
      Move code formatting job ID into a base class · ce594241
      Michael Hanselmann authored
      A later patch will add a memory based job storage class, hence this
      code is going into a separate class. It also changes the number format
      to always use at least 10 digits, allowing up to 9'999'999'999 jobs to
      be sorted without using a custom function.
      
      Reviewed-by: iustinp
      ce594241
    • Michael Hanselmann's avatar
      Rename JobStorage to DiskJobStorage · 21cc1fbd
      Michael Hanselmann authored
      Reviewed-by: iustinp
      21cc1fbd
    • Michael Hanselmann's avatar
      Fix logging with string job IDs · 205d71fd
      Michael Hanselmann authored
      The job ID is now a string, hence logging must use %s instead of %d.
      
      Reviewed-by: iustinp
      205d71fd
    • Michael Hanselmann's avatar
      Make job ID a string · 3be9a705
      Michael Hanselmann authored
      The docstring says that _NewSerialUnlocked returns “a string
      representing the job identifier”. Until now it returned an
      integer and this patch changes it.
      
      Reviewed-by: iustinp
      3be9a705
    • Iustin Pop's avatar
      Distribute the queue serial file after each update · c3f0a12f
      Iustin Pop authored
      This patch adds distribution of the queue serial file after each write
      to it (but before a new job is created and written with that ID, and
      before a response is returned, so we should be safe from crashes in
      between).
      
      Currently it only logs if a node cannot be contacted, it should abort if
      > 50% errors are seen.
      
      Reviewed-by: imsnah
      c3f0a12f
    • Iustin Pop's avatar
      Make the job storage init reuse a serial file · c4beba1c
      Iustin Pop authored
      This will be needed for master failover. If we don't have a valid queue
      directory, we need to reinitialize it, but we should keep the existing
      serial number.
      
      As such, we abstract the reading of the serial and if we find a valid
      serial, we do not reset it.
      
      Reviewed-by: imsnah
      c4beba1c
  20. 22 Jul, 2008 1 commit
    • Michael Hanselmann's avatar
      Make argument to CleanCacheUnlocked mandatory · 57f8615f
      Michael Hanselmann authored
      Not passing the argument means it has the value None. Iterating None
      doesn't work:
        >>> "123" in None
        Traceback (most recent call last):
          File "<stdin>", line 1, in ?
        TypeError: iterable argument required
      
      Hence I rename it to "exclude" instead of "exceptions", which may be
      confusing, and make it mandatory. If one wants to clean all cache
      entries, an empty list can be passed.
      
      Reviewed-by: iustinp
      57f8615f