1. 13 Jan, 2010 1 commit
  2. 04 Jan, 2010 4 commits
  3. 28 Dec, 2009 1 commit
  4. 25 Nov, 2009 1 commit
  5. 06 Nov, 2009 2 commits
    • Guido Trotter's avatar
      Processor: support a unique execution id · adfa97e3
      Guido Trotter authored
      
      
      When the processor is executing a job, it can export the execution id to
      its callers. This is not supported for Queries, as they're not executed
      in a job.
      Signed-off-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      adfa97e3
    • Iustin Pop's avatar
      Fix pylint 'E' (error) codes · 6c881c52
      Iustin Pop authored
      
      
      This patch adds some silences and tweaks the code slightly so that
      “pylint --rcfile pylintrc -e ganeti” doesn't give any errors.
      
      The biggest change is in jqueue.py, the move of _RequireOpenQueue out of
      the JobQueue class. Since that is actually a function and not a method
      (never used as such) this makes sense, and also silences two pylint
      errors.
      
      Another real code change is in utils.py, where FieldSet.Matches will
      return None instead of False for failure; this still works with the way
      this class/method is used, and makes more sense (it resembles more
      closely the re.match return values).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      6c881c52
  6. 03 Nov, 2009 1 commit
  7. 12 Oct, 2009 1 commit
  8. 25 Sep, 2009 1 commit
  9. 17 Sep, 2009 1 commit
  10. 15 Sep, 2009 4 commits
  11. 07 Sep, 2009 1 commit
    • Iustin Pop's avatar
      Optimise multi-job submit · 009e73d0
      Iustin Pop authored
      
      
      Currently, on multi-job submits we simply iterate over the
      single-job-submit function. This means we grab a new serial, write and
      replicate (and wait for the remote nodes to ack) the serial file, and
      only then create the job file; this is repeated N times, once for each
      job.
      
      Since job identifiers are ‘cheap’, it's simpler to simply grab at the
      start a block of new IDs, write and replicate the serial count file a
      single time, and then proceed with the jobs as before. This is a cheap
      change that reduces I/O and reduces slightly the CPU consumption of the
      master daemon: submit time seems to be cut in half for big batches of
      jobs and the masterd cpu time by (I can't get consistent numbers)
      between 15%-50%.
      
      Note that this doesn't change anything for single-job submits and most
      probably for < 5 job submits either.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      009e73d0
  12. 03 Sep, 2009 1 commit
  13. 27 Aug, 2009 1 commit
  14. 03 Aug, 2009 2 commits
  15. 19 Jul, 2009 4 commits
    • Iustin Pop's avatar
      job queue: fix loss of finalized opcode result · 34327f51
      Iustin Pop authored
      
      
      Currently, unclean master daemon shutdown overwrites all of a job's
      opcode status and result with error/None. This is incorrect, since the
      any already finished opcode(s) should have their status and result
      preserved, and only not-yet-processed opcodes should be marked as
      ‘error’. Cancelling jobs between opcodes does the same (but this is not
      allowed currently by the code, so it's not as important as unclean
      shutdown).
      
      This patch adds a new _QueuedJob function that only overwrites the
      status and result of finalized opcodes, which is then used in job queue
      init and in the cancel job functions. The patch also adds some comments
      and a new set constants in constants.py highlighting the finalized vs.
      non-finalized opcode statuses.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      34327f51
    • Iustin Pop's avatar
      Add a luxi call for multi-job submit · 56d8ff91
      Iustin Pop authored
      
      
      As a workaround for the job submit timeouts that we have, this patch
      adds a new luxi call for multi-job submit; the advantage is that all the
      jobs are added in the queue and only after the workers can start
      processing them.
      
      This is definitely faster than per-job submit, where the submission of
      new jobs competes with the workers processing jobs.
      
      On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
        - 100 jobs:
          - individual: submit time ~21s, processing time ~21s
          - multiple:   submit time 7-9s, processing time ~22s
        - 250 jobs:
          - individual: submit time ~56s, processing time ~57s
                        run 2:      ~54s                  ~55s
          - multiple:   submit time ~20s, processing time ~51s
                        run 2:      ~17s                  ~52s
      
      which shows that we indeed gain on the client side, and maybe even on
      the total processing time for a high number of jobs. For just 10 or so I
      expect the difference to be just noise.
      
      This will probably require increasing the timeout a little when
      submitting too many jobs - 250 jobs at ~20 seconds is close to the
      current rw timeout of 60s.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      (cherry picked from commit 2971c913)
      56d8ff91
    • Iustin Pop's avatar
      job queue: fix interrupted job processing · f6424741
      Iustin Pop authored
      
      
      If a job with more than one opcodes is being processed, and the master
      daemon crashes between two opcodes, we have the first N opcodes marked
      successful, and the rest marked as queued. This means that the overall
      jbo status is queued, and thus on master daemon restart it will be
      resent for completion.
      
      However, the RunTask() function in jqueue.py doesn't deal with
      partially-completed jobs. This patch makes it simply skip such opcodes.
      
      An alternative option would be to not mark partially-completed jobs as
      QUEUED but instead RUNNING, which would result in aborting of the job at
      restart time.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      f6424741
    • Iustin Pop's avatar
      Fix an error path in job queue worker's RunTask · ed21712b
      Iustin Pop authored
      
      
      In case the job fails, we try to set the job's run_op_idx to -1.
      However, this is a wrong variable, which wasn't detected until the
      __slots__ addition. The correct variable is run_op_index.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      ed21712b
  16. 17 Jul, 2009 1 commit
  17. 07 Jul, 2009 1 commit
  18. 15 Jun, 2009 1 commit
  19. 21 May, 2009 1 commit
    • Iustin Pop's avatar
      Add a luxi call for multi-job submit · 2971c913
      Iustin Pop authored
      
      
      As a workaround for the job submit timeouts that we have, this patch
      adds a new luxi call for multi-job submit; the advantage is that all the
      jobs are added in the queue and only after the workers can start
      processing them.
      
      This is definitely faster than per-job submit, where the submission of
      new jobs competes with the workers processing jobs.
      
      On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
        - 100 jobs:
          - individual: submit time ~21s, processing time ~21s
          - multiple:   submit time 7-9s, processing time ~22s
        - 250 jobs:
          - individual: submit time ~56s, processing time ~57s
                        run 2:      ~54s                  ~55s
          - multiple:   submit time ~20s, processing time ~51s
                        run 2:      ~17s                  ~52s
      
      which shows that we indeed gain on the client side, and maybe even on
      the total processing time for a high number of jobs. For just 10 or so I
      expect the difference to be just noise.
      
      This will probably require increasing the timeout a little when
      submitting too many jobs - 250 jobs at ~20 seconds is close to the
      current rw timeout of 60s.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      2971c913
  20. 12 Feb, 2009 1 commit
    • Iustin Pop's avatar
      job queue: log the opcode error too · 0f6be82a
      Iustin Pop authored
      Currently we only log "Error in opcode ...", but we don't log the error itself.
      This is not good for debugging.
      
      Reviewed-by: ultrotter
      0f6be82a
  21. 28 Jan, 2009 1 commit
    • Iustin Pop's avatar
      Fix some issues related to job cancelling · df0fb067
      Iustin Pop authored
      This patch fixes two issues with the cancel mechanism:
        - cancelled jobs show as such, and not in error state (we mark them as
          OP_STATUS_CANCELED and not OP_STATUS_ERROR)
        - queued jobs which are cancelled don't raise errors in the master (we
          treat OP_STATUS_CANCELED now)
      
      Reviewed-by: imsnah
      df0fb067
  22. 27 Jan, 2009 1 commit
  23. 20 Jan, 2009 1 commit
    • Iustin Pop's avatar
      Update the logging output of job processing · d21d09d6
      Iustin Pop authored
      (this is related to the master daemon log)
      
      Currently it's not possible to follow (in the non-debug runs) the
      logical execution thread of jobs. This is due to the fact that we don't
      log the thread name (so we lose the association of log messages to jobs)
      and we don't log the start/stop of job and opcode execution.
      
      This patch adds a new parameter to utils.SetupLogging that enables
      thread name logging, and promotes some log entries from debug to info.
      With this applied, it's easier to understand which log messages relate
      to which jobs/opcodes.
      
      The patch also moves the "INFO client closed connection" entry to debug
      level, since it's not a very informative log entry.
      
      Reviewed-by: ultrotter
      d21d09d6
  24. 15 Jan, 2009 1 commit
    • Iustin Pop's avatar
      Some docstring updates · 25e7b43f
      Iustin Pop authored
      This patch rewraps some comments to shorter lengths, changes
      double-quotes to single-quotes inside triple-quoted docstrings for
      better editor handling.
      
      It also fixes some epydoc errors, namely invalid crossreferences (after
      method rename), documentation for inexistent (removed) parameters, etc.
      
      Reviewed-by: ultrotter
      25e7b43f
  25. 18 Dec, 2008 5 commits