Skip to content
Snippets Groups Projects
  1. Dec 29, 2010
  2. Dec 15, 2010
    • Michael Hanselmann's avatar
      jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694
      Michael Hanselmann authored
      
      Iustin Pop reported that a job's file is updated many times while it
      waits for locks held by other thread(s). After an investigation it was
      concluded that the reason was a design decision for job priorities to
      return jobs to the “queued” status if they couldn't acquire all locks.
      Changing a jobs' status or priority requires an update to permanent
      storage.
      
      In a high-level view this is what happens:
      1. Mark as waitlock
      2. Write to disk as permanent storage (jobs left in this state by a
         crashing master daemon are resumed on restart)
      3. Wait for lock (assume lock is held by another thread)
      4. Mark as queued
      5. Write to disk again
      6. Return to workerpool
      
      Another option originally discussed was to leave the job in the
      “waitlock” status. Ignoring priority changes, this is what would happen:
      1. If not in waitlock
      1.1. Assert state == queued
      1.2. Mark as waitlock
      1.3. Set start_timestamp
      1.4. Write to disk as permanent storage
      3. Wait for locks (assume lock is held by another thread)
      4. Leave in waitlock
      5. Return to workerpool
      
      Now let's assume the lock is released by the other thread:
      […]
      3. Wait for locks and get them
      4. Assert state == waitlock
      5. Set state to running
      6. Set exec_timestamp
      7. Write to disk
      
      As this change reduces the number of writes from two per lock acquire
      attempt to two per opcode and one per priority increase (as happens
      after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
      the highest priority is reached), here's the patch to implement it.
      Unittests are updated.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      5fd6b694
  3. Oct 12, 2010
  4. Oct 07, 2010
  5. Sep 24, 2010
  6. Sep 23, 2010
  7. Sep 20, 2010
  8. Sep 16, 2010
  9. Sep 13, 2010
  10. Sep 10, 2010
  11. Sep 07, 2010
  12. Aug 24, 2010
  13. Aug 19, 2010
    • Michael Hanselmann's avatar
      jqueue: Remove lock status field · 9bdab621
      Michael Hanselmann authored
      
      With the job queue changes for Ganeti 2.2, watched and queried jobs are
      loaded directly from disk, rendering the in-memory “lock_status” field
      useless. Writing it to disk would be possible, but has a huge cost at
      runtime (when tested, processing 1'000 opcodes involved 4'000 additional
      writes to job files, even with replication turned off).
      
      Using an additional in-memory dictionary to just manage this field turned
      out to be a complicated task due to the necessary locking.
      
      The plan is to introduce a more generic lock debugging mechanism in the
      near future. Hence the decision is to remove this field now instead of
      spending a lot of time to make it working again.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      9bdab621
  14. Aug 18, 2010
  15. Aug 17, 2010
  16. Jul 30, 2010
    • Iustin Pop's avatar
      Fix a few job archival issues · aa9f8167
      Iustin Pop authored
      
      This patch fixes two issues with job archival. First, the
      LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add
      None to the job list; we can't anyway, as this raises an exception:
      
        node1# gnt-job archive foo
        Unhandled protocol error while talking to the master daemon:
        Caught exception: cannot create weak reference to 'NoneType' object
      
      After fixing this, job archival of missing jobs will just continue
      silently, so we modify gnt-job archive to log jobs which were not
      archived and to return exit code 1 for any missing jobs.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      aa9f8167
  17. Jul 29, 2010
    • Iustin Pop's avatar
      Change handling of non-Ganeti errors in jqueue · 599ee321
      Iustin Pop authored
      
      Currently, if a job execution raises a Ganeti-specific error (i.e.
      subclass of GenericError), then we encode it as (error class, [error
      args]). This matches the RAPI documentation.
      
      However, if we get a non-Ganeti error, then we encode it as simply
      str(err), a single string. This means that the opresult field is not
      according to the RAPI docs, and thus it's hard to reliably parse the
      job results.
      
      This patch changes the encoding of a failed job (via failure) to always
      be an OpExecError, so that we always encode it properly. For the command
      line interface, the behaviour is the same, as any non-Ganeti errors get
      re-encoded as OpExecError anyway. For the RAPI clients, it only means
      that we always present the same type for results. The actual error value
      is the same, since the err.args is either way str(original_error);
      compare the original (doesn't contain the ValueError):
      
        "opresult": [
          "invalid literal for int(): aa"
        ],
      
      with:
      
        "opresult": [
          [
            "OpExecError",
            [
              "invalid literal for int(): aa"
            ]
          ]
        ],
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      599ee321
    • Michael Hanselmann's avatar
      workerpool: Change signature of AddTask function to not use *args · b2e8a4d9
      Michael Hanselmann authored
      
      By changing it to a normal parameter, which must be a sequence, we can
      start using keyword parameters.
      
      Before this patch all arguments to “AddTask(self, *args)” were passed as
      arguments to the worker's “RunTask” method. Priorities, which should be
      optional and will be implemented in a future patch, must be passed as a keyword
      parameter. This means “*args” can no longer be used as one can't combine *args
      and keyword parameters in a clean way:
      
      >>> def f(name=None, *args):
      ...   print "%r, %r" % (args, name)
      ...
      >>> f("p1", "p2", "p3", name="thename")
      Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       TypeError: f() got multiple values for keyword argument 'name'
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      b2e8a4d9
  18. Jul 16, 2010
    • Iustin Pop's avatar
      Implement lock names for debugging purposes · 7f93570a
      Iustin Pop authored
      
      This patch adds lock names to SharedLocks and LockSets, that can be used
      later for displaying the actual locks being held/used in places where we
      only have the lock, and not the entire context of the locking operation.
      
      Since I realized that the production code doesn't call LockSet with the
      proper members= syntax, but directly as positional parameters, I've
      converted this (and the arguments to GlobalLockManager) into positional
      arguments.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      7f93570a
  19. Jul 15, 2010
    • Michael Hanselmann's avatar
      jqueue: Factorize code waiting for job changes · 989a8bee
      Michael Hanselmann authored
      
      By splitting the _WaitForJobChangesHelper class into multiple smaller
      classes, we gain in several places:
      
      - Simpler code, less interaction between functions and variables
      - Easy to unittest (close to 100% coverage)
      - Waiting for job changes has no direct knowledge of queue anymore (it
        doesn't references queue functions anymore, especially not private ones)
      - Activate inotify only if there was no change at the beginning (and
        checking again right away to avoid race conditions)
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      989a8bee
  20. Jul 12, 2010
  21. Jul 09, 2010
  22. Jul 06, 2010
    • Iustin Pop's avatar
      Fix opcode transition from WAITLOCK to RUNNING · 271daef8
      Iustin Pop authored
      
      With the recent changes in the job queue, an old bug surfaced: we never
      serialized the status change when in NotifyStart, thus a crash of the
      master would have left the job queue oblivious to the fact that the job
      was actually running.
      
      In the previous implementation, queries against the job status were
      using the in-memory object, so they 'saw' and reported correctly the
      running status. But the new implementation just looks at the on-disk
      version, and thus didn't see this transition.
      
      The patch also moves NotifyStart to a decorator-based version (like the
      other functions), which generates a lot of churn in the diff, sorry.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      271daef8
Loading