LUDiagnoseOS: change locking and error handling
Since the “list OSes” call is exported via RAPI, this can be used pretty easily to DOS the master daemon during long jobs. The implementation of LUDiagnoseOS makes an RPC call to all nodes; we lock nodes here in order to prevent node removal. However, after closer examination, the worst case is: - we get the list of nodes from the config - another thread removes a node - our RPC queries reach the removed node As this point, if ganeti-noded is stopped or doesn't accept our queries, the RPC call will return failed, and in the current implementation all OSes will become invalid. If we change the ‘failed RPC’ handling to ignore such nodes, this allows us to both remove locking, and to handle transient RPC failures better (not invalidating all OSes). This patch does both these things, with a single drawback: in gnt-os diagnose, the down nodes do not appear at all. I think this is a small drawback, and the alternative is to add them with status failed; this works (3-line patch), but then the output of “list” and “diagnose” will no longer be consistent. As such, my proposal is to not list the nodes. Reviewed-by: ultrotter
Loading
Please register or sign in to comment