• Iustin Pop's avatar
    Increase the lock timeouts before we block-acquire · d385a174
    Iustin Pop authored
    
    
    This has been observed to cause problems on real clusters via the
    following mechanism:
    
    - a long job (e.g. a replace-disks) is keeping an exclusive lock on an
      instance
    - the watcher starts and submits its query instances opcode which
      wants shared locks for all instances
    - after about an hour, the watcher job falls back to blocking acquire,
      after having acquired all other locks
    - any instance opcode that wants an exclusive lock for an instance
      cannot start until the watcher has finished, even though there's no
      actual operation on that instance
    
    In order to alleviate this problem, we simply increase the max timeout
    until lock acquires are sent back to either blocking acquire or
    priority increase. The timeout is computed such that we wait ~10 hours
    (instead of one) for this to happen, which should be within the
    maximum lifetime of a reasonable opcode on a healthy cluster. The
    timeout also means that priority increases will happen every half hour.
    
    We also increase the max wait interval to 15 seconds, otherwise we'd
    have too many retries with the increased interval.
    Signed-off-by: default avatarIustin Pop <iustin@google.com>
    Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
    d385a174
constants.py 33.6 KB