-
Apollon Oikonomopoulos authored
DrbdAttachNet supports both, normal primary/secondary node operation, and (during live migration) dual-primary operation. When resources are newly attached, we poll until we find all of them in connected or syncing operation. Although aggressive, this is enough for primary/secondary operation, because the primary/secondary role is not changed from within DrbdAttachNet. However, in the dual-primary ("multimaster") case, both peers are subsequently upgraded to the primary role. If - for unspecified reasons - both disks are not UpToDate, then a resync may be triggered after both peers have switched to primary, causing the resource to disconnect: kernel: [1465514.164009] block drbd2: I shall become SyncTarget, but I am primary! kernel: [1465514.171562] block drbd2: ASSERT( os.conn == C_WF_REPORT_PARAMS ) in /build/linux-rrsxby/linux-3.2.51/drivers/block/drbd/drbd_receiver.c:3245 This seems to be extremely racey and is possibly triggered by some underlying network issues (e.g. high latency), but it has been observed in the wild. By logging the DRBD resource state in the old secondary, we managed to see a resource getting promoted to primary while it was: WFSyncUUID Secondary/Primary Outdated/UpToDate We fix this by explicitly waiting for "Connected" cstate and "UpToDate/UpToDate" disks, as advised in [1]: "For this purpose and scenario, you only want to promote once you are Connected UpToDate/UpToDate." [1] http://lists.linbit.com/pipermail/drbd-user/2013-July/020173.html Signed-off-by: Apollon Oikonomopoulos <apoikos@gmail.com> Signed-off-by: Michele Tartara <mtartara@google.com> Reviewed-by: Michele Tartara <mtartara@google.com> Reviewed-by: Klaus Aehlig <aehlig@google.com>
73e15b5e