1. 25 Mar, 2011 1 commit
  2. 23 Mar, 2011 1 commit
  3. 28 Feb, 2011 1 commit
  4. 29 Dec, 2010 1 commit
  5. 15 Dec, 2010 1 commit
    • Michael Hanselmann's avatar
      jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694
      Michael Hanselmann authored
      
      
      Iustin Pop reported that a job's file is updated many times while it
      waits for locks held by other thread(s). After an investigation it was
      concluded that the reason was a design decision for job priorities to
      return jobs to the “queued” status if they couldn't acquire all locks.
      Changing a jobs' status or priority requires an update to permanent
      storage.
      
      In a high-level view this is what happens:
      1. Mark as waitlock
      2. Write to disk as permanent storage (jobs left in this state by a
         crashing master daemon are resumed on restart)
      3. Wait for lock (assume lock is held by another thread)
      4. Mark as queued
      5. Write to disk again
      6. Return to workerpool
      
      Another option originally discussed was to leave the job in the
      “waitlock” status. Ignoring priority changes, this is what would happen:
      1. If not in waitlock
      1.1. Assert state == queued
      1.2. Mark as waitlock
      1.3. Set start_timestamp
      1.4. Write to disk as permanent storage
      3. Wait for locks (assume lock is held by another thread)
      4. Leave in waitlock
      5. Return to workerpool
      
      Now let's assume the lock is released by the other thread:
      […]
      3. Wait for locks and get them
      4. Assert state == waitlock
      5. Set state to running
      6. Set exec_timestamp
      7. Write to disk
      
      As this change reduces the number of writes from two per lock acquire
      attempt to two per opcode and one per priority increase (as happens
      after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
      the highest priority is reached), here's the patch to implement it.
      Unittests are updated.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      5fd6b694
  6. 12 Oct, 2010 3 commits
  7. 07 Oct, 2010 1 commit
  8. 24 Sep, 2010 3 commits
  9. 23 Sep, 2010 3 commits
  10. 20 Sep, 2010 3 commits
  11. 16 Sep, 2010 1 commit
  12. 13 Sep, 2010 4 commits
  13. 10 Sep, 2010 3 commits
  14. 07 Sep, 2010 2 commits
  15. 24 Aug, 2010 1 commit
  16. 19 Aug, 2010 1 commit
    • Michael Hanselmann's avatar
      jqueue: Remove lock status field · 9bdab621
      Michael Hanselmann authored
      
      
      With the job queue changes for Ganeti 2.2, watched and queried jobs are
      loaded directly from disk, rendering the in-memory “lock_status” field
      useless. Writing it to disk would be possible, but has a huge cost at
      runtime (when tested, processing 1'000 opcodes involved 4'000 additional
      writes to job files, even with replication turned off).
      
      Using an additional in-memory dictionary to just manage this field turned
      out to be a complicated task due to the necessary locking.
      
      The plan is to introduce a more generic lock debugging mechanism in the
      near future. Hence the decision is to remove this field now instead of
      spending a lot of time to make it working again.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      9bdab621
  17. 18 Aug, 2010 3 commits
  18. 17 Aug, 2010 2 commits
  19. 30 Jul, 2010 1 commit
    • Iustin Pop's avatar
      Fix a few job archival issues · aa9f8167
      Iustin Pop authored
      
      
      This patch fixes two issues with job archival. First, the
      LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add
      None to the job list; we can't anyway, as this raises an exception:
      
        node1# gnt-job archive foo
        Unhandled protocol error while talking to the master daemon:
        Caught exception: cannot create weak reference to 'NoneType' object
      
      After fixing this, job archival of missing jobs will just continue
      silently, so we modify gnt-job archive to log jobs which were not
      archived and to return exit code 1 for any missing jobs.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      aa9f8167
  20. 29 Jul, 2010 2 commits
    • Iustin Pop's avatar
      Change handling of non-Ganeti errors in jqueue · 599ee321
      Iustin Pop authored
      
      
      Currently, if a job execution raises a Ganeti-specific error (i.e.
      subclass of GenericError), then we encode it as (error class, [error
      args]). This matches the RAPI documentation.
      
      However, if we get a non-Ganeti error, then we encode it as simply
      str(err), a single string. This means that the opresult field is not
      according to the RAPI docs, and thus it's hard to reliably parse the
      job results.
      
      This patch changes the encoding of a failed job (via failure) to always
      be an OpExecError, so that we always encode it properly. For the command
      line interface, the behaviour is the same, as any non-Ganeti errors get
      re-encoded as OpExecError anyway. For the RAPI clients, it only means
      that we always present the same type for results. The actual error value
      is the same, since the err.args is either way str(original_error);
      compare the original (doesn't contain the ValueError):
      
        "opresult": [
          "invalid literal for int(): aa"
        ],
      
      with:
      
        "opresult": [
          [
            "OpExecError",
            [
              "invalid literal for int(): aa"
            ]
          ]
        ],
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      599ee321
    • Michael Hanselmann's avatar
      workerpool: Change signature of AddTask function to not use *args · b2e8a4d9
      Michael Hanselmann authored
      
      
      By changing it to a normal parameter, which must be a sequence, we can
      start using keyword parameters.
      
      Before this patch all arguments to “AddTask(self, *args)” were passed as
      arguments to the worker's “RunTask” method. Priorities, which should be
      optional and will be implemented in a future patch, must be passed as a keyword
      parameter. This means “*args” can no longer be used as one can't combine *args
      and keyword parameters in a clean way:
      
      >>> def f(name=None, *args):
      ...   print "%r, %r" % (args, name)
      ...
      >>> f("p1", "p2", "p3", name="thename")
      Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       TypeError: f() got multiple values for keyword argument 'name'
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      b2e8a4d9
  21. 16 Jul, 2010 1 commit
    • Iustin Pop's avatar
      Implement lock names for debugging purposes · 7f93570a
      Iustin Pop authored
      
      
      This patch adds lock names to SharedLocks and LockSets, that can be used
      later for displaying the actual locks being held/used in places where we
      only have the lock, and not the entire context of the locking operation.
      
      Since I realized that the production code doesn't call LockSet with the
      proper members= syntax, but directly as positional parameters, I've
      converted this (and the arguments to GlobalLockManager) into positional
      arguments.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      7f93570a
  22. 15 Jul, 2010 1 commit
    • Michael Hanselmann's avatar
      jqueue: Factorize code waiting for job changes · 989a8bee
      Michael Hanselmann authored
      
      
      By splitting the _WaitForJobChangesHelper class into multiple smaller
      classes, we gain in several places:
      
      - Simpler code, less interaction between functions and variables
      - Easy to unittest (close to 100% coverage)
      - Waiting for job changes has no direct knowledge of queue anymore (it
        doesn't references queue functions anymore, especially not private ones)
      - Activate inotify only if there was no change at the beginning (and
        checking again right away to avoid race conditions)
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      989a8bee