• Michael Hanselmann's avatar
    jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694
    Michael Hanselmann authored
    
    
    Iustin Pop reported that a job's file is updated many times while it
    waits for locks held by other thread(s). After an investigation it was
    concluded that the reason was a design decision for job priorities to
    return jobs to the “queued” status if they couldn't acquire all locks.
    Changing a jobs' status or priority requires an update to permanent
    storage.
    
    In a high-level view this is what happens:
    1. Mark as waitlock
    2. Write to disk as permanent storage (jobs left in this state by a
       crashing master daemon are resumed on restart)
    3. Wait for lock (assume lock is held by another thread)
    4. Mark as queued
    5. Write to disk again
    6. Return to workerpool
    
    Another option originally discussed was to leave the job in the
    “waitlock” status. Ignoring priority changes, this is what would happen:
    1. If not in waitlock
    1.1. Assert state == queued
    1.2. Mark as waitlock
    1.3. Set start_timestamp
    1.4. Write to disk as permanent storage
    3. Wait for locks (assume lock is held by another thread)
    4. Leave in waitlock
    5. Return to workerpool
    
    Now let's assume the lock is released by the other thread:
    […]
    3. Wait for locks and get them
    4. Assert state == waitlock
    5. Set state to running
    6. Set exec_timestamp
    7. Write to disk
    
    As this change reduces the number of writes from two per lock acquire
    attempt to two per opcode and one per priority increase (as happens
    after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
    the highest priority is reached), here's the patch to implement it.
    Unittests are updated.
    Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
    Reviewed-by: default avatarIustin Pop <iustin@google.com>
    5fd6b694
ganeti.jqueue_unittest.py 41.5 KB