diff --git a/doc/design-hroller.rst b/doc/design-hroller.rst new file mode 100644 index 0000000000000000000000000000000000000000..632531bb91f26da6a5e74be5f64694f7598e928b --- /dev/null +++ b/doc/design-hroller.rst @@ -0,0 +1,154 @@ +============ +HRoller tool +============ + +.. contents:: :depth: 4 + +This is a design document detailing the cluster maintenance scheduler, +HRoller. + + +Current state and shortcomings +============================== + +To enable automating cluster-wide reboots a new htool, called HRoller, +was added to Ganeti starting from version 2.7. This tool helps +parallelizing cluster offline maintenances by calculating which nodes +are not both primary and secondary for a DRBD instance, and thus can be +rebooted at the same time, when all instances are down. + +The way this is done is documented in the :manpage:`hroller(1)` manpage. + +We would now like to perform online maintenance on the cluster by +rebooting nodes after evacuating their primary instances (rolling +reboots). + +Proposed changes +================ + + +Calculating rolling maintenances +-------------------------------- + +In order to perform rolling maintenance we need to migrate instances off +the nodes before a reboot. How this can be done depends on the +instance's disk template and status: + +Down instances +++++++++++++++ + +If an instance was shutdown when the maintenance started it will be +ignored. This allows avoiding needlessly moving its primary around, +since it won't suffer a downtime anyway. + + +DRBD +++++ + +Each node must migrate all instances off to their secondaries, and then +can either be rebooted, or the secondaries can be evacuated as well. + +Since currently doing a ``replace-disks`` on DRBD breaks redundancy, +it's not any safer than temporarily rebooting a node with secondaries on +them (citation needed). As such we'll implement for now just the +"migrate+reboot" mode, and focus later on replace-disks as well. + +In order to do that we can use the following algorithm: + +1) Compute node sets that don't contain both the primary and the +secondary for any instance. This can be done already by the current +hroller graph coloring algorithm: nodes are in the same set (color) if +and only if no edge (instance) exists between them (see the +:manpage:`hroller(1)` manpage for more details). +2) Inside each node set calculate subsets that don't have any secondary +node in common (this can be done by creating a graph of nodes that are +connected if and only if an instance on both has the same secondary +node, and coloring that graph) +3) It is then possible to migrate in parallel all nodes in a subset +created at step 2, and then reboot/perform maintenance on them, and +migrate back their original primaries, which allows the computation +above to be reused for each following subset without N+1 failures being +triggered, if none were present before. See below about the actual +execution of the maintenance. + +Non-DRBD +++++++++ + +All non-DRBD disk templates that can be migrated have no "secondary" +concept. As such instances can be migrated to any node (in the same +nodegroup). In order to do the job we can either: + +- Perform migrations on one node at a time, perform the maintenance on + that node, and proceed (the node will then be targeted again to host + instances automatically, as hail chooses targets for the instances + between all nodes in a group. Nodes in different nodegroups can be + handled in parallel. +- Perform migrations on one node at a time, but without waiting for the + first node to come back before proceeding. This allows us to continue, + restricting the cluster, until no more capacity in the nodegroup is + available, and then having to wait for some nodes to come back so that + capacity is available again for the last few nodes. +- Pre-Calculate sets of nodes that can be migrated together (probably + with a greedy algorithm) and parallelize between them, with the + migrate-back approach discussed for DRBD to perform the calculation + only once. + +Note that for non-DRBD disks that still use local storage (eg. RBD and +plain) redundancy might break anyway, and nothing except the first +algorithm might be safe. This perhaps would be a good reason to consider +managing better RBD pools, if those are implemented on top of nodes +storage, rather than on dedicated storage machines. + +Executing rolling maintenances +------------------------------ + +Hroller accepts commands to run to do maintenance automatically. These +are going to be run on the machine hroller runs on, and take a node name +as input. They have then to gain access to the target node (via ssh, +restricted commands, or some other means) and perform their duty. + +1) A command (--check-cmd) will be called on all selected online nodes +to check whether a node needs maintenance. Hroller will proceed only on +nodes that respond positively to this invocation. +FIXME: decide about -D +2) Hroller will evacuate the node of all primary instances. +3) A command (--maint-cmd) will be called on a node to do the actual +maintenance operation. It should do any operation needed to perform the +maintenance including triggering the actual reboot. +3) A command (--verify-cmd) will be called to check that the operation +was successful, it has to wait until the target node is back up (and +decide after how long it should give up) and perform the verification. +If it's not successful hroller will stop and not proceed with other +nodes. +4) The master node will be kept last, but will not otherwise be treated +specially. If hroller was running on the master node, care must be +exercised as its maintenance will have interrupted the software itself, +and as such the verification step will not happen. This will not +automatically be taken care of, in the first version. An additional flag +to just skip the master node will be present as well, in case that's +preferred. + + +Future work +=========== + +DRBD nodes' ``replace-disks``' functionality should be implemented. Note +that when we will support a DRBD version that allows multi-secondary +this can be done safely, without losing replication at any time, by +adding a temporary secondary and only when the sync is finished dropping +the previous one. + +If/when RBD pools can be managed inside Ganeti, care can be taken so +that the pool is evacuated as well from a node before it's put into +maintenance. This is equivalent to evacuating DRBD secondaries. + +Master failovers during the maintenance should be performed by hroller. +This requires RPC/RAPI support for master failover. Hroller should also +be modified to better support running on the master itself and +continuing on the new master. + +.. vim: set textwidth=72 : +.. Local Variables: +.. mode: rst +.. fill-column: 72 +.. End: