-
Dato Simó authored
Commas are not valid characters in tags, hence they can't be use to separate the different job IDs; plus signs (+) are available, and not too bad. Signed-off-by:
Dato Simó <dato@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
6d675203
Instance auto-repair
Contents
This is a design document detailing the implementation of self-repair and recreation of instances in Ganeti. It also discusses ideas that might be useful for more future self-repair situations.
Current state and shortcomings
Ganeti currently doesn't do any sort of self-repair or self-recreate of instances:
- If a drbd instance is broken (its primary of secondary nodes go offline or need to be drained) an admin or an external tool must fail it over if necessary, and then trigger a disk replacement.
- If a plain instance is broken (or both nodes of a drbd instance are) an admin or an external tool must recreate its disk and reinstall it.
Moreover in an oversubscribed cluster operations mentioned above might fail for lack of capacity until a node is repaired or a new one added. In this case an external tool would also need to go through any "pending-recreate" or "pending-repair" instances and fix them.
Proposed changes
We'd like to increase the self-repair capabilities of Ganeti, at least with regards to instances. In order to do so we plan to add mechanisms to mark an instance as "due for being repaired" and then the relevant repair to be performed as soon as it's possible, on the cluster.
The self repair will be written as part of ganeti-watcher or as an extra watcher component that is called less often.
As the first version we'll only handle the case in which an instance lives on an offline or drained node. In the future we may add more self-repair capabilities for errors ganeti can detect.
New attributes (or tags)
In order to know when to perform a self-repair operation we need to know whether they are allowed by the cluster administrator.
This can be implemented as either new attributes or tags. Tags could be acceptable as they would only be read and interpreted by the self-repair tool (part of the watcher), and not by the ganeti core opcodes and node rpcs. The following tags would be needed:
ganeti:watcher:autorepair:<type>
(instance/nodegroup/cluster) Allow repairs to happen on an instance that has the tag, or that lives in a cluster or nodegroup which does. Types of repair are in order of perceived risk, lower to higher, and each type includes allowing the operations in the lower ones: