RPC: mark jobqueue functions as URGENT
Recently, we've seen more and more cases of a specific breakage pattern in Ganeti: master candidates which are semi-alive (as in, they respond to ping, they can complete a TCP/SSL handshake, but otherwise the root filesystem is broken) cause lots of confusion within masterd. My analysis shows that waiting up to 5 minutes for a reply from such a broken master candidate is too long, and this long wait breaks other timeouts (e.g. the Luxi timeout), making standard recovery from this situation very hard. It's much easier to kill the master daemon, edit manually the config file and mark the node as regular, then restart the master daemon. The proposal is therefore to reduce the timeout for the job queue functions to TMO_URGENT (1 minute), which should be more balanced between a working but overloaded node and a broken node. Signed-off-by:Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
Loading
Please register or sign in to comment