• Iustin Pop's avatar
    LUDiagnoseOS: change locking and error handling · a6ab004b
    Iustin Pop authored
    Since the “list OSes” call is exported via RAPI, this can be used pretty
    easily to DOS the master daemon during long jobs.
    
    The implementation of LUDiagnoseOS makes an RPC call to all nodes; we
    lock nodes here in order to prevent node removal.
    
    However, after closer examination, the worst case is:
      - we get the list of nodes from the config
      - another thread removes a node
      - our RPC queries reach the removed node
    
    As this point, if ganeti-noded is stopped or doesn't accept our queries,
    the RPC call will return failed, and in the current implementation all
    OSes will become invalid.
    
    If we change the ‘failed RPC’ handling to ignore such nodes, this allows
    us to both remove locking, and to handle transient RPC failures better
    (not invalidating all OSes).
    
    This patch does both these things, with a single drawback: in gnt-os
    diagnose, the down nodes do not appear at all. I think this is a small
    drawback, and the alternative is to add them with status failed; this
    works (3-line patch), but then the output of “list” and “diagnose” will
    no longer be consistent. As such, my proposal is to not list the nodes.
    
    Reviewed-by: ultrotter
    a6ab004b
cmdlib.py 242 KB