• Iustin Pop's avatar
    Reduce the chance of DRBD errors with stale primaries · fdbd668d
    Iustin Pop authored
    This patch is a first step in reducing the chance of causing DRBD
    activation failures when the primary node has not-perfect data.
    This issue is more seen with DRBD8, which has an 'outdate' state (in
    which it can get more often). But it can (and before this patch, usually
    will) happen with both 7 and 8 in the case the primary has data to sync.
    The error comes from the fact that, before this patch, we activate the
    primary DRBD device and immediately (i.e. as soon as we can run another
    shell command) we try to make it primary. This might fail - since the
    primary knows it has some data to catch up to - but we ignored this
    error condition. The failure was visible later, in either md failing to
    activate over a read-only storage or by instance failing to start.
    The patch has two parts: one affecting bdev.py, which changes failures
    in BlockDev.Open() from returning False to raising
    errors.BlockDeviceError; noone (except a generic method inside bdev.py)
    checked this return value and we logged it but the master didn't know
    about it; now all classes raise errors from Open if they have a failure.
    The other part, affecting cmdlib.py, changes the activation sequence
      - activate on primary node as primary and secondary as secondary, in
        whatever order a function returns the nodes
    to the following:
      - activate all drives as secondaries, on both the primary and the
        secondary nodes of the instance
      - after that, on the primary node, re-activate the device stack as
    This is in order to give the chance to DRBD to connect and make the
    handshake. As noted in the comments, this just increases the chances of
    a handshake/connect, not fixing entirely the problem. However, it is a
    good first step and it passes all tests of starting with stale (either
    full or partial) primaries, with both drbd 7 and 8, and also passes a
    Note that the patch might make the device activation a little bit
    slower, but it is a reasonable trade-off.
    Reviewed-by: imsnah
cmdlib.py 152 KB