Skip to content
Snippets Groups Projects
Commit d385a174 authored by Iustin Pop's avatar Iustin Pop
Browse files

Increase the lock timeouts before we block-acquire


This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: default avatarIustin Pop <iustin@google.com>
Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
parent fe295df3
No related branches found
No related tags found
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment