diff --git a/Makefile.am b/Makefile.am index 3e2eba4f7dc98003ece96c4acef94a65c0968dac..ee420b7d8a8a86b8ab151db776c3cf38025ba297 100644 --- a/Makefile.am +++ b/Makefile.am @@ -331,6 +331,7 @@ docrst = \ doc/design-network.rst \ doc/design-chained-jobs.rst \ doc/design-ovf-support.rst \ + doc/design-autorepair.rst \ doc/design-resource-model.rst \ doc/cluster-merge.rst \ doc/design-shared-storage.rst \ diff --git a/doc/design-autorepair.rst b/doc/design-autorepair.rst new file mode 100644 index 0000000000000000000000000000000000000000..480979bc135dd22e7afa033d105ee3158ffdc18b --- /dev/null +++ b/doc/design-autorepair.rst @@ -0,0 +1,313 @@ +==================== +Instance auto-repair +==================== + +.. contents:: :depth: 4 + +This is a design document detailing the implementation of self-repair and +recreation of instances in Ganeti. It also discusses ideas that might be useful +for more future self-repair situations. + +Current state and shortcomings +============================== + +Ganeti currently doesn't do any sort of self-repair or self-recreate of +instances: + +- If a drbd instance is broken (its primary of secondary nodes go + offline or need to be drained) an admin or an external tool must fail + it over if necessary, and then trigger a disk replacement. +- If a plain instance is broken (or both nodes of a drbd instance are) + an admin or an external tool must recreate its disk and reinstall it. + +Moreover in an oversubscribed cluster operations mentioned above might +fail for lack of capacity until a node is repaired or a new one added. +In this case an external tool would also need to go through any +"pending-recreate" or "pending-repair" instances and fix them. + +Proposed changes +================ + +We'd like to increase the self-repair capabilities of Ganeti, at least +with regards to instances. In order to do so we plan to add mechanisms +to mark an instance as "due for being repaired" and then the relevant +repair to be performed as soon as it's possible, on the cluster. + +The self repair will be written as part of ganeti-watcher or as an extra +watcher component that is called less often. + +As the first version we'll only handle the case in which an instance +lives on an offline or drained node. In the future we may add more +self-repair capabilities for errors ganeti can detect. + +New attributes (or tags) +------------------------ + +In order to know when to perform a self-repair operation we need to know +whether they are allowed by the cluster administrator. + +This can be implemented as either new attributes or tags. Tags could be +acceptable as they would only be read and interpreted by the self-repair tool +(part of the watcher), and not by the ganeti core opcodes and node rpcs. The +following tags would be needed: + +ganeti:watcher:autorepair:<type> +++++++++++++++++++++++++++++++++ + +(instance/nodegroup/cluster) +Allow repairs to happen on an instance that has the tag, or that lives +in a cluster or nodegroup which does. Types of repair are in order of +perceived risk, lower to higher, and each type includes allowing the +operations in the lower ones: + +- ``fix-storage`` allows a disk replacement or another operation that + fixes the instance backend storage without affecting the instance + itself. This can for example recover from a broken drbd secondary, but + risks data loss if something is wrong on the primary but the secondary + was somehow recoverable. +- ``migrate`` allows an instance migration. This can recover from a + drained primary, but can cause an instance crash in some cases (bugs). +- ``failover`` allows instance reboot on the secondary. This can recover + from an offline primary, but the instance will lose its running state. +- ``reinstall`` allows disks to be recreated and an instance to be + reinstalled. This can recover from primary&secondary both being + offline, or from an offline primary in the case of non-redundant + instances. It causes data loss. + +Each repair type allows all the operations in the previous types, in the +order above, in order to ensure a repair can be completed fully. As such +a repair of a lower type might not be able to proceed if it detects an +error condition that requires a more risky or drastic solution, but +never vice versa (if a worse solution is allowed then so is a better +one). + +ganeti:watcher:autorepair:suspend[:<timestamp>] ++++++++++++++++++++++++++++++++++++++++++++++++ + +(instance/nodegroup/cluster) +If this tag is encountered no autorepair operations will start for the +instance (or for any instance, if present at the cluster or group +level). Any job which already started will be allowed to finish, but +then the autorepair system will not proceed further until this tag is +removed, or the timestamp passes (in which case the tag will be removed +automatically by the watcher). + +Note that depending on how this tag is used there might still be race +conditions related to it for an external tool that uses it +programmatically, as no "lock tag" or tag "test-and-set" operation is +present at this time. While this is known we won't solve these race +conditions in the first version. + +It might also be useful to easily have an operation that tags all +instances matching a filter on some charateristic. But again, this +wouldn't be specific to this tag. + +ganeti:watcher:repair:pending:<type>:<id>:<timestamp>:<jobs> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +(instance) +If this tag is present a repair of type ``type`` is pending on the +target instance. This means that either jobs are being run, or it's +waiting for resource availability. ``id`` is the unique id identifying +this repair, ``timestamp`` is the time when this tag was first applied +to this instance for this ``id`` (we will "update" the tag by adding a +"new copy" of it and removing the old version as we run more jobs, but +the timestamp will never change for the same repair) + +``jobs`` is the list of jobs already run or being run to repair the +instance. If the instance has just been put in pending state but no job +has run yet, this list is empty. + +This tag will be set by ganeti if an equivalent autorepair tag is +present and a a repair is needed, or can be set by an external tool to +request a repair as a "once off". + +If multiple instances of this tag are present they will be handled in +order of timestamp. + +ganeti:watcher:repair:result:<type>:<id>:<timestamp>:<result>:<jobs> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +(instance) +If this tag is present a repair of type ``type`` has been performed on +the instance and has been completed by ``timestamp``. The result is +either ``success``, ``failure`` or ``enoperm``, and jobs is a comma +separated list of jobs that were executed for this repair. + +An ``enoperm`` result is returned when the repair was brought on until +possible, but the repair type doesn't consent to proceed further. + +Possible states, and transitions +-------------------------------- + +At any point an instance can be in one of the following health states: + +Healthy ++++++++ + +The instance lives on only online nodes. The autorepair system will +never touch these instances. Any ``repair:pending`` tags will be removed +and marked ``success`` with no jobs attached to them. + +This state can transition to: + +- Needs-repair, repair disallowed (node offlined or drained, no + autorepair tag) +- Needs-repair, autorepair allowed (node offlined or drained, autorepair + tag present) +- Suspended (a suspend tag is added) + +Suspended ++++++++++ + +Whenever a ``repair:suspend`` tag is added the autorepair code won't +touch the instance until the timestamp on the tag has passed, if +present. The tag will be removed afterwards (and the instance will +transition to its correct state, depending on its health and other +tags). + +Note that when an instance is suspended any pending repair is +interrupted, but jobs which were submitted before the suspension are +allowed to finish. + +Needs-repair, repair disallowed ++++++++++++++++++++++++++++++++ + +The instance lives on an offline or drained node, but no autorepair tag +is set, or the autorepair tag set is of a type not powerful enough to +finish the repair. The autorepair system will never touch these +instances, and they can transition to: + +- Healthy (manual repair) +- Pending repair (a ``repair:pending`` tag is added) +- Needs-repair, repair allowed always (an autorepair always tag is added) +- Suspended (a suspend tag is added) + +Needs-repair, repair allowed always ++++++++++++++++++++++++++++++++++++ + +A ``repair:pending`` tag is added, and the instance transitions to the +Pending Repair state. The autorepair tag is preserved. + +Of course if a ``repair:suspended`` tag is found no pending tag will be +added, and the instance will instead transition to the Suspended state. + +Pending repair +++++++++++++++ + +When an instance is in this stage the following will happen: + +If a ``repair:suspended`` tag is found the instance won't be touched and +moved to the Suspended state. Any jobs which were already running will +be left untouched. + +If there are still jobs running related to the instance and scheduled by +this repair they will be given more time to run, and the instance will +be checked again later. The state transitions to itself. + +If no jobs are running and the instance is detected to be healthy, the +``repair:result`` tag will be added, and the current active +``repair:pending`` tag will be removed. It will then transition to the +Healthy state if there are no ``repair:pending`` tags, or to the Pending +state otherwise: there, the instance being healthy, those tags will be +resolved without any operation as well (note that this is the same as +transitioning to the Healthy state, where ``repair:pending`` tags would +also be resolved). + +If no jobs are running and the instance still has issues: + +- if the last job(s) failed it can either be retried a few times, if + deemed to be safe, or the repair can transition to the Failed state. + The ``repair:result`` tag will be added, and the active + ``repair:pending`` tag will be removed (further ``repair:pending`` + tags will not be able to proceed, as explained by the Failed state, + until the failure state is cleared) +- if the last job(s) succeeded but there are not enough resources to + proceed, the state will transition to itself and no jobs are + scheduled. The tag is left untouched (and later checked again). This + basically just delays any repairs, the current ``pending`` tag stays + active, and any others are untouched). +- if the last job(s) succeeded but the repair type cannot allow to + proceed any further the ``repair:result`` tag is added with an + ``enoperm`` result, and the current ``repair:pending`` tag is removed. + The instance is now back to "Needs-repair, repair disallowed", + "Needs-repair, autorepair allowed", or "Pending" if there is already a + future tag that can repair the instance. +- if the last job(s) succeeded and the repair can continue new job(s) + can be submitted, and the ``repair:pending`` tag can be updated. + +Failed +++++++ + +If repairing an instance has failed a ``repair:result:failure`` is +added. The presence of this tag is used to detect that an instance is in +this state, and it will not be touched until the failure is investigated +and the tag is removed. + +An external tool or person needs to investigate the state of the +instance and remove this tag when he is sure the instance is repaired +and safe to turn back to the normal autorepair system. + +(Alternatively we can use the suspended state (indefinitely or +temporarily) to mark the instance as "not touch" when we think a human +needs to look at it. To be decided). + +Repair operation +---------------- + +Possible repairs are: + +- Replace-disks (drbd, if the secondary is down), (or other storage + specific fixes) +- Migrate (shared storage, rbd, drbd, if the primary is drained) +- Failover (shared storage, rbd, drbd, if the primary is down) +- Recreate disks + reinstall (all nodes down, plain, files or drbd) + +Note that more than one of these operations may need to happen before a +full repair is completed (eg. if a drbd primary goes offline first a +failover will happen, then a replce-disks). + +The self-repair tool will first take care of all needs-repair instance +that can be brought into ``pending`` state, and transition them as +described above. + +Then it will go through any ``repair:pending`` instances and handle them +as described above. + +Note that the repair tool MAY "group" instances by performing common +repair jobs for them (eg: node evacuate). + +Staging of work +--------------- + +First version: recreate-disks + reinstall (2.6.1) +Second version: failover and migrate repairs (2.7) +Third version: replace disks repair (2.7 or 2.8) + +Future work +=========== + +One important piece of work will be reporting what the autorepair system +is "thinking" and exporting this in a form that can be read by an +outside user or system. In order to do this we need a better +communication system than embedding this information into tags. This +should be thought in an extensible way that can be used in general for +Ganeti to provide "advisory" information about entities it manages, and +for an external system to "advise" ganeti over what it can do, but in a +less direct manner than submitting individual jobs. + +Note that cluster verify checks some errors that are actually instance +specific, (eg. a missing backend disk on a drbd node) or node-specific +(eg. an extra lvm device). If we were to split these into "instance +verify", "node verify" and "cluster verify", then we could easily use +this tool to perform some of those repairs as well. + +Finally self-repairs could also be extended to the cluster level, for +example concepts like "N+1 failures", missing master candidates, etc. or +node level for some specific types of errors. + +.. vim: set textwidth=72 : +.. Local Variables: +.. mode: rst +.. fill-column: 72 +.. End: diff --git a/doc/design-draft.rst b/doc/design-draft.rst index 629a386c212f6010ddcd6225df68e3f1d4b514a7..33d5bac0211ba69b8b55ce011c629587fb12a825 100644 --- a/doc/design-draft.rst +++ b/doc/design-draft.rst @@ -15,6 +15,7 @@ Design document drafts design-resource-model.rst design-virtual-clusters.rst design-query-splitting.rst + design-autorepair.rst .. vim: set textwidth=72 : .. Local Variables: