From 68640987f07ae5991b9b284f823629c402fa7a03 Mon Sep 17 00:00:00 2001
From: Guido Trotter <ultrotter@google.com>
Date: Mon, 23 Jul 2012 16:09:36 +0100
Subject: [PATCH] Instance autorepair design

This design describes a tool that will perform automatic repairs on
instances when they are detected to be unhealthy (living on offline or
drained nodes, at the moment). These repairs can be scheduled
automatically or requested as a one-off by a tool or person.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Bernardo Dal Seno <bdalseno@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
---
 Makefile.am               |   1 +
 doc/design-autorepair.rst | 313 ++++++++++++++++++++++++++++++++++++++
 doc/design-draft.rst      |   1 +
 3 files changed, 315 insertions(+)
 create mode 100644 doc/design-autorepair.rst

diff --git a/Makefile.am b/Makefile.am
index f7b2a51a7..0f31795e1 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -335,6 +335,7 @@ docrst = \
 	doc/design-network.rst \
 	doc/design-chained-jobs.rst \
 	doc/design-ovf-support.rst \
+	doc/design-autorepair.rst \
 	doc/design-resource-model.rst \
 	doc/cluster-merge.rst \
 	doc/design-shared-storage.rst \
diff --git a/doc/design-autorepair.rst b/doc/design-autorepair.rst
new file mode 100644
index 000000000..480979bc1
--- /dev/null
+++ b/doc/design-autorepair.rst
@@ -0,0 +1,313 @@
+====================
+Instance auto-repair
+====================
+
+.. contents:: :depth: 4
+
+This is a design document detailing the implementation of self-repair and
+recreation of instances in Ganeti. It also discusses ideas that might be useful
+for more future self-repair situations.
+
+Current state and shortcomings
+==============================
+
+Ganeti currently doesn't do any sort of self-repair or self-recreate of
+instances:
+
+- If a drbd instance is broken (its primary of secondary nodes go
+  offline or need to be drained) an admin or an external tool must fail
+  it over if necessary, and then trigger a disk replacement.
+- If a plain instance is broken (or both nodes of a drbd instance are)
+  an admin or an external tool must recreate its disk and reinstall it.
+
+Moreover in an oversubscribed cluster operations mentioned above might
+fail for lack of capacity until a node is repaired or a new one added.
+In this case an external tool would also need to go through any
+"pending-recreate" or "pending-repair" instances and fix them.
+
+Proposed changes
+================
+
+We'd like to increase the self-repair capabilities of Ganeti, at least
+with regards to instances. In order to do so we plan to add mechanisms
+to mark an instance as "due for being repaired" and then the relevant
+repair to be performed as soon as it's possible, on the cluster.
+
+The self repair will be written as part of ganeti-watcher or as an extra
+watcher component that is called less often.
+
+As the first version we'll only handle the case in which an instance
+lives on an offline or drained node. In the future we may add more
+self-repair capabilities for errors ganeti can detect.
+
+New attributes (or tags)
+------------------------
+
+In order to know when to perform a self-repair operation we need to know
+whether they are allowed by the cluster administrator.
+
+This can be implemented as either new attributes or tags. Tags could be
+acceptable as they would only be read and interpreted by the self-repair tool
+(part of the watcher), and not by the ganeti core opcodes and node rpcs. The
+following tags would be needed:
+
+ganeti:watcher:autorepair:<type>
+++++++++++++++++++++++++++++++++
+
+(instance/nodegroup/cluster)
+Allow repairs to happen on an instance that has the tag, or that lives
+in a cluster or nodegroup which does. Types of repair are in order of
+perceived risk, lower to higher, and each type includes allowing the
+operations in the lower ones:
+
+- ``fix-storage`` allows a disk replacement or another operation that
+  fixes the instance backend storage without affecting the instance
+  itself. This can for example recover from a broken drbd secondary, but
+  risks data loss if something is wrong on the primary but the secondary
+  was somehow recoverable.
+- ``migrate`` allows an instance migration. This can recover from a
+  drained primary, but can cause an instance crash in some cases (bugs).
+- ``failover`` allows instance reboot on the secondary. This can recover
+  from an offline primary, but the instance will lose its running state.
+- ``reinstall`` allows disks to be recreated and an instance to be
+  reinstalled. This can recover from primary&secondary both being
+  offline, or from an offline primary in the case of non-redundant
+  instances. It causes data loss.
+
+Each repair type allows all the operations in the previous types, in the
+order above, in order to ensure a repair can be completed fully. As such
+a repair of a lower type might not be able to proceed if it detects an
+error condition that requires a more risky or drastic solution, but
+never vice versa (if a worse solution is allowed then so is a better
+one).
+
+ganeti:watcher:autorepair:suspend[:<timestamp>]
++++++++++++++++++++++++++++++++++++++++++++++++
+
+(instance/nodegroup/cluster)
+If this tag is encountered no autorepair operations will start for the
+instance (or for any instance, if present at the cluster or group
+level). Any job which already started will be allowed to finish, but
+then the autorepair system will not proceed further until this tag is
+removed, or the timestamp passes (in which case the tag will be removed
+automatically by the watcher).
+
+Note that depending on how this tag is used there might still be race
+conditions related to it for an external tool that uses it
+programmatically, as no "lock tag" or tag "test-and-set" operation is
+present at this time. While this is known we won't solve these race
+conditions in the first version.
+
+It might also be useful to easily have an operation that tags all
+instances matching a  filter on some charateristic. But again, this
+wouldn't be specific to this tag.
+
+ganeti:watcher:repair:pending:<type>:<id>:<timestamp>:<jobs>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+(instance)
+If this tag is present a repair of type ``type`` is pending on the
+target instance. This means that either jobs are being run, or it's
+waiting for resource availability. ``id`` is the unique id identifying
+this repair, ``timestamp`` is the time when this tag was first applied
+to this instance for this ``id`` (we will "update" the tag by adding a
+"new copy" of it and removing the old version as we run more jobs, but
+the timestamp will never change for the same repair)
+
+``jobs`` is the list of jobs already run or being run to repair the
+instance. If the instance has just been put in pending state but no job
+has run yet, this list is empty.
+
+This tag will be set by ganeti if an equivalent autorepair tag is
+present and a a repair is needed, or can be set by an external tool to
+request a repair as a "once off".
+
+If multiple instances of this tag are present they will be handled in
+order of timestamp.
+
+ganeti:watcher:repair:result:<type>:<id>:<timestamp>:<result>:<jobs>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+(instance)
+If this tag is present a repair of type ``type`` has been performed on
+the instance and has been completed by ``timestamp``. The result is
+either ``success``, ``failure`` or ``enoperm``, and jobs is a comma
+separated list of jobs that were executed for this repair.
+
+An ``enoperm`` result is returned when the repair was brought on until
+possible, but the repair type doesn't consent to proceed further.
+
+Possible states, and transitions
+--------------------------------
+
+At any point an instance can be in one of the following health states:
+
+Healthy
++++++++
+
+The instance lives on only online nodes. The autorepair system will
+never touch these instances. Any ``repair:pending`` tags will be removed
+and marked ``success`` with no jobs attached to them.
+
+This state can transition to:
+
+- Needs-repair, repair disallowed (node offlined or drained, no
+  autorepair tag)
+- Needs-repair, autorepair allowed (node offlined or drained, autorepair
+  tag present)
+- Suspended (a suspend tag is added)
+
+Suspended
++++++++++
+
+Whenever a ``repair:suspend`` tag is added the autorepair code won't
+touch the instance until the timestamp on the tag has passed, if
+present. The tag will be removed afterwards (and the instance will
+transition to its correct state, depending on its health and other
+tags).
+
+Note that when an instance is suspended any pending repair is
+interrupted, but jobs which were submitted before the suspension are
+allowed to finish.
+
+Needs-repair, repair disallowed
++++++++++++++++++++++++++++++++
+
+The instance lives on an offline or drained node, but no autorepair tag
+is set, or the autorepair tag set is of a type not powerful enough to
+finish the repair. The autorepair system will never touch these
+instances, and they can transition to:
+
+- Healthy (manual repair)
+- Pending repair (a ``repair:pending`` tag is added)
+- Needs-repair, repair allowed always (an autorepair always tag is added)
+- Suspended (a suspend tag is added)
+
+Needs-repair, repair allowed always
++++++++++++++++++++++++++++++++++++
+
+A ``repair:pending`` tag is added, and the instance transitions to the
+Pending Repair state. The autorepair tag is preserved.
+
+Of course if a ``repair:suspended`` tag is found no pending tag will be
+added, and the instance will instead transition to the Suspended state.
+
+Pending repair
+++++++++++++++
+
+When an instance is in this stage the following will happen:
+
+If a ``repair:suspended`` tag is found the instance won't be touched and
+moved to the Suspended state. Any jobs which were already running will
+be left untouched.
+
+If there are still jobs running related to the instance and scheduled by
+this repair they will be given more time to run, and the instance will
+be checked again later.  The state transitions to itself.
+
+If no jobs are running and the instance is detected to be healthy, the
+``repair:result`` tag will be added, and the current active
+``repair:pending`` tag will be removed. It will then transition to the
+Healthy state if there are no ``repair:pending`` tags, or to the Pending
+state otherwise: there, the instance being healthy, those tags will be
+resolved without any operation as well (note that this is the same as
+transitioning to the Healthy state, where ``repair:pending`` tags would
+also be resolved).
+
+If no jobs are running and the instance still has issues:
+
+- if the last job(s) failed it can either be retried a few times, if
+  deemed to be safe, or the repair can transition to the Failed state.
+  The ``repair:result`` tag will be added, and the active
+  ``repair:pending`` tag will be removed (further ``repair:pending``
+  tags will not be able to proceed, as explained by the Failed state,
+  until the failure state is cleared)
+- if the last job(s) succeeded but there are not enough resources to
+  proceed, the state will transition to itself and no jobs are
+  scheduled. The tag is left untouched (and later checked again). This
+  basically just delays any repairs, the current ``pending`` tag stays
+  active, and any others are untouched).
+- if the last job(s) succeeded but the repair type cannot allow to
+  proceed any further the ``repair:result`` tag is added with an
+  ``enoperm`` result, and the current ``repair:pending`` tag is removed.
+  The instance is now back to "Needs-repair, repair disallowed",
+  "Needs-repair, autorepair allowed", or "Pending" if there is already a
+  future tag that can repair the instance.
+- if the last job(s) succeeded and the repair can continue new job(s)
+  can be submitted, and the ``repair:pending`` tag can be updated.
+
+Failed
+++++++
+
+If repairing an instance has failed a ``repair:result:failure`` is
+added. The presence of this tag is used to detect that an instance is in
+this state, and it will not be touched until the failure is investigated
+and the tag is removed.
+
+An external tool or person needs to investigate the state of the
+instance and remove this tag when he is sure the instance is repaired
+and safe to turn back to the normal autorepair system.
+
+(Alternatively we can use the suspended state (indefinitely or
+temporarily) to mark the instance as "not touch" when we think a human
+needs to look at it. To be decided).
+
+Repair operation
+----------------
+
+Possible repairs are:
+
+- Replace-disks (drbd, if the secondary is down), (or other storage
+  specific fixes)
+- Migrate (shared storage, rbd, drbd, if the primary is drained)
+- Failover (shared storage, rbd, drbd, if the primary is down)
+- Recreate disks + reinstall (all nodes down, plain, files or drbd)
+
+Note that more than one of these operations may need to happen before a
+full repair is completed (eg. if a drbd primary goes offline first a
+failover will happen, then a replce-disks).
+
+The self-repair tool will first take care of all needs-repair instance
+that can be brought into ``pending`` state, and transition them as
+described above.
+
+Then it will go through any ``repair:pending`` instances and handle them
+as described above.
+
+Note that the repair tool MAY "group" instances by performing common
+repair jobs for them (eg: node evacuate).
+
+Staging of work
+---------------
+
+First version: recreate-disks + reinstall (2.6.1)
+Second version: failover and migrate repairs (2.7)
+Third version: replace disks repair (2.7 or 2.8)
+
+Future work
+===========
+
+One important piece of work will be reporting what the autorepair system
+is "thinking" and exporting this in a form that can be read by an
+outside user or system. In order to do this we need a better
+communication system than embedding this information into tags. This
+should be thought in an extensible way that can be used in general for
+Ganeti to provide "advisory" information about entities it manages, and
+for an external system to "advise" ganeti over what it can do, but in a
+less direct manner than submitting individual jobs.
+
+Note that cluster verify checks some errors that are actually instance
+specific, (eg. a missing backend disk on a drbd node) or node-specific
+(eg. an extra lvm device). If we were to split these into "instance
+verify", "node verify" and "cluster verify", then we could easily use
+this tool to perform some of those repairs as well.
+
+Finally self-repairs could also be extended to the cluster level, for
+example concepts like "N+1 failures", missing master candidates, etc. or
+node level for some specific types of errors.
+
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End:
diff --git a/doc/design-draft.rst b/doc/design-draft.rst
index 629a386c2..33d5bac02 100644
--- a/doc/design-draft.rst
+++ b/doc/design-draft.rst
@@ -15,6 +15,7 @@ Design document drafts
    design-resource-model.rst
    design-virtual-clusters.rst
    design-query-splitting.rst
+   design-autorepair.rst
 
 .. vim: set textwidth=72 :
 .. Local Variables:
-- 
GitLab