Skip to content
Snippets Groups Projects
Commit d2cd6944 authored by Iustin Pop's avatar Iustin Pop
Browse files

RPC: mark jobqueue functions as URGENT


Recently, we've seen more and more cases of a specific breakage
pattern in Ganeti: master candidates which are semi-alive (as in, they
respond to ping, they can complete a TCP/SSL handshake, but otherwise
the root filesystem is broken) cause lots of confusion within masterd.

My analysis shows that waiting up to 5 minutes for a reply from such a
broken master candidate is too long, and this long wait breaks other
timeouts (e.g. the Luxi timeout), making standard recovery from this
situation very hard. It's much easier to kill the master daemon, edit
manually the config file and mark the node as regular, then restart
the master daemon.

The proposal is therefore to reduce the timeout for the job queue
functions to TMO_URGENT (1 minute), which should be more balanced
between a working but overloaded node and a broken node.

Signed-off-by: default avatarIustin Pop <iustin@google.com>
Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
parent 362c5845
No related branches found
No related tags found
No related merge requests found
# #
# #
# Copyright (C) 2006, 2007, 2008, 2009, 2010 Google Inc. # Copyright (C) 2006, 2007, 2008, 2009, 2010, 2011 Google Inc.
# #
# This program is free software; you can redistribute it and/or modify # This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by # it under the terms of the GNU General Public License as published by
...@@ -1401,7 +1401,7 @@ class RpcRunner(object): ...@@ -1401,7 +1401,7 @@ class RpcRunner(object):
[old_file_storage_dir, new_file_storage_dir]) [old_file_storage_dir, new_file_storage_dir])
@classmethod @classmethod
@_RpcTimeout(_TMO_FAST) @_RpcTimeout(_TMO_URGENT)
def call_jobqueue_update(cls, node_list, address_list, file_name, content): def call_jobqueue_update(cls, node_list, address_list, file_name, content):
"""Update job queue. """Update job queue.
...@@ -1423,7 +1423,7 @@ class RpcRunner(object): ...@@ -1423,7 +1423,7 @@ class RpcRunner(object):
return cls._StaticSingleNodeCall(node, "jobqueue_purge", []) return cls._StaticSingleNodeCall(node, "jobqueue_purge", [])
@classmethod @classmethod
@_RpcTimeout(_TMO_FAST) @_RpcTimeout(_TMO_URGENT)
def call_jobqueue_rename(cls, node_list, address_list, rename): def call_jobqueue_rename(cls, node_list, address_list, rename):
"""Rename a job queue file. """Rename a job queue file.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment