• Iustin Pop's avatar
    Wait for a while in failed resyncs · fbafd7a8
    Iustin Pop authored
    This patch is an attempt at fixing some very rare occurrences of messages like:
      - "There are some degraded disks for this instance", or:
      - "Cannot resync disks on node node3.example.com: [True, 100]"
    What I believe happens is that drbd has finished syncing, but not all
    fields are updated in 'Connected' state; maybe it's in WFBitmap[ST], or
    in some other transient state we don't handle well.
    The patch will change the _WaitForSync method to recheck up to a
    hardcoded number of times if we're finished syncing but we're degraded
    (using the same condition as the 'break' clause of the loop).
    The cons of this changes is that a normal, really-degraded due to
    network or disk failure will cause an extra delay before it aborts. For
    this, I'm happy to choose other values.
    A better, long term fix is to handle more DRBD state correctly (see the
    bdev.DRBD8Status class).
    Signed-off-by: default avatarIustin Pop <iustin@google.com>
    Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
cmdlib.py 245 KB