Reduce the chance of DRBD errors with stale primaries (fdbd668d) · Commits · itminedu / snf-ganeti

Commit fdbd668d authored 17 years ago by Iustin Pop

Reduce the chance of DRBD errors with stale primaries

This patch is a first step in reducing the chance of causing DRBD
activation failures when the primary node has not-perfect data.

This issue is more seen with DRBD8, which has an 'outdate' state (in
which it can get more often). But it can (and before this patch, usually
will) happen with both 7 and 8 in the case the primary has data to sync.

The error comes from the fact that, before this patch, we activate the
primary DRBD device and immediately (i.e. as soon as we can run another
shell command) we try to make it primary. This might fail - since the
primary knows it has some data to catch up to - but we ignored this
error condition. The failure was visible later, in either md failing to
activate over a read-only storage or by instance failing to start.

The patch has two parts: one affecting bdev.py, which changes failures
in BlockDev.Open() from returning False to raising
errors.BlockDeviceError; noone (except a generic method inside bdev.py)
checked this return value and we logged it but the master didn't know
about it; now all classes raise errors from Open if they have a failure.

The other part, affecting cmdlib.py, changes the activation sequence
from:
  - activate on primary node as primary and secondary as secondary, in
    whatever order a function returns the nodes
to the following:
  - activate all drives as secondaries, on both the primary and the
    secondary nodes of the instance
  - after that, on the primary node, re-activate the device stack as
    primary

This is in order to give the chance to DRBD to connect and make the
handshake. As noted in the comments, this just increases the chances of
a handshake/connect, not fixing entirely the problem. However, it is a
good first step and it passes all tests of starting with stale (either
full or partial) primaries, with both drbd 7 and 8, and also passes a
burnin.

Note that the patch might make the device activation a little bit
slower, but it is a reasonable trade-off.

Reviewed-by: imsnah

parent b08d5a87

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 54 additions and 26 deletions

Please register or to comment