Commit a6ab004b authored by Iustin Pop's avatar Iustin Pop
Browse files

LUDiagnoseOS: change locking and error handling

Since the “list OSes” call is exported via RAPI, this can be used pretty
easily to DOS the master daemon during long jobs.

The implementation of LUDiagnoseOS makes an RPC call to all nodes; we
lock nodes here in order to prevent node removal.

However, after closer examination, the worst case is:
  - we get the list of nodes from the config
  - another thread removes a node
  - our RPC queries reach the removed node

As this point, if ganeti-noded is stopped or doesn't accept our queries,
the RPC call will return failed, and in the current implementation all
OSes will become invalid.

If we change the ‘failed RPC’ handling to ignore such nodes, this allows
us to both remove locking, and to handle transient RPC failures better
(not invalidating all OSes).

This patch does both these things, with a single drawback: in gnt-os
diagnose, the down nodes do not appear at all. I think this is a small
drawback, and the alternative is to add them with status failed; this
works (3-line patch), but then the output of “list” and “diagnose” will
no longer be consistent. As such, my proposal is to not list the nodes.

Reviewed-by: ultrotter
parent ea9ddc07
......@@ -1682,9 +1682,11 @@ class LUDiagnoseOS(NoHooksLU):
selected=self.op.output_fields)
# Lock all nodes, in shared mode
# Temporary removal of locks, should be reverted later
# TODO: reintroduce locks when they are lighter-weight
self.needed_locks = {}
self.share_locks[locking.LEVEL_NODE] = 1
self.needed_locks[locking.LEVEL_NODE] = locking.ALL_SET
#self.share_locks[locking.LEVEL_NODE] = 1
#self.needed_locks[locking.LEVEL_NODE] = locking.ALL_SET
def CheckPrereq(self):
"""Check prerequisites.
......@@ -1708,6 +1710,11 @@ class LUDiagnoseOS(NoHooksLU):
"""
all_os = {}
# we build here the list of nodes that didn't fail the RPC (at RPC
# level), so that nodes with a non-responding node daemon don't
# make all OSes invalid
good_nodes = [node_name for node_name in rlist
if not rlist[node_name].failed]
for node_name, nr in rlist.iteritems():
if nr.failed or not nr.data:
continue
......@@ -1716,7 +1723,7 @@ class LUDiagnoseOS(NoHooksLU):
# build a list of nodes for this os containing empty lists
# for each node in node_list
all_os[os_obj.name] = {}
for nname in node_list:
for nname in good_nodes:
all_os[os_obj.name][nname] = []
all_os[os_obj.name][node_name].append(os_obj)
return all_os
......@@ -1725,9 +1732,7 @@ class LUDiagnoseOS(NoHooksLU):
"""Compute the list of OSes.
"""
node_list = self.acquired_locks[locking.LEVEL_NODE]
valid_nodes = [node for node in self.cfg.GetOnlineNodeList()
if node in node_list]
valid_nodes = [node for node in self.cfg.GetOnlineNodeList()]
node_data = self.rpc.call_os_diagnose(valid_nodes)
if node_data == False:
raise errors.OpExecError("Can't gather the list of OSes")
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment