Wait for a while in failed resyncs
This patch is an attempt at fixing some very rare occurrences of messages like: - "There are some degraded disks for this instance", or: - "Cannot resync disks on node node3.example.com: [True, 100]" What I believe happens is that drbd has finished syncing, but not all fields are updated in 'Connected' state; maybe it's in WFBitmap[ST], or in some other transient state we don't handle well. The patch will change the _WaitForSync method to recheck up to a hardcoded number of times if we're finished syncing but we're degraded (using the same condition as the 'break' clause of the loop). The cons of this changes is that a normal, really-degraded due to network or disk failure will cause an extra delay before it aborts. For this, I'm happy to choose other values. A better, long term fix is to handle more DRBD state correctly (see the bdev.DRBD8Status class). Signed-off-by:Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
Loading
Please register or sign in to comment