Commit 425f0f54 authored by Iustin Pop's avatar Iustin Pop

Add a delay in master failover

I have seen some very seldom errors where (it seems) the address is
still live for a short while after removing it from the old master, thus
the new master will fail in startup/adding its own IP address.

To prevent against this, we add a delay/retry before we proceed, if the
IP is still reachable.
Signed-off-by: default avatarIustin Pop <iustin@google.com>
Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
parent ed14ed48
......@@ -560,7 +560,19 @@ def MasterFailover(no_voting=False):
logging.error("Could not disable the master role on the old master"
" %s, please disable manually: %s", old_master, msg)
master_ip = sstore.GetMasterIP()
total_timeout = 30
# Here we have a phase where no master should be running
def _check_ip():
if utils.TcpPing(master_ip, constants.DEFAULT_NODED_PORT):
raise utils.RetryAgain()
try:
utils.Retry(_check_ip, (1, 1.5, 5), total_timeout)
except utils.RetryTimeout:
logging.warning("The master IP is still reachable after %s seconds,"
" continuing but activating the master on the current"
" node will probably fail", total_timeout)
# instantiate a real config writer, as we now know we have the
# configuration data
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment