From 92921ea4ec3cbcbab7267491ade2153c0faa73b6 Mon Sep 17 00:00:00 2001
From: Iustin Pop <iustin@google.com>
Date: Fri, 10 Sep 2010 17:10:26 +0200
Subject: [PATCH] Add design for htools/Ganeti 2.3 sync

This is a work in progress, will be modified along with the progress
of Ganeti 2.3.
---
 doc/design-ganeti-2.3.rst | 320 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 320 insertions(+)
 create mode 100644 doc/design-ganeti-2.3.rst

diff --git a/doc/design-ganeti-2.3.rst b/doc/design-ganeti-2.3.rst
new file mode 100644
index 000000000..a6beabd44
--- /dev/null
+++ b/doc/design-ganeti-2.3.rst
@@ -0,0 +1,320 @@
+====================================
+ Synchronising htools to Ganeti 2.3
+====================================
+
+Ganeti 2.3 introduces a number of new features that change the cluster
+internals significantly enough that the htools suite needs to be
+updated accordingly in order to function correctly.
+
+Shared storage support
+======================
+
+Currently, the htools algorithms presume a model where all of an
+instance's resources is served from within the cluster, more
+specifically from the nodes comprising the cluster. While is this
+usual for memory and CPU, deployments which use shared storage will
+invalidate this assumption for storage.
+
+To account for this, we need to move some assumptions from being
+implicit (and hardcoded) to being explicitly exported from Ganeti.
+
+
+New instance parameters
+-----------------------
+
+It is presumed that Ganeti will export for all instances a new
+``storage_type`` parameter, that will denote either internal storage
+(e.g. *plain* or *drbd*), or external storage.
+
+Furthermore, a new ``storage_pool`` parameter will classify, for both
+internal and external storage, the pool out of which the storage is
+allocated. For internal storage, this will be either ``lvm`` (the pool
+that provides space to both ``plain`` and ``drbd`` instances) or
+``file`` (for file-storage-based instances). For external storage,
+this will be the respective NAS/SAN/cloud storage that backs up the
+instance. Note that for htools, external storage pools are opaque; we
+only care that they have an identifier, so that we can distinguish
+between two different pools.
+
+If these two parameters are not present, the instances will be
+presumed to be ``internal/lvm``.
+
+New node parameters
+-------------------
+
+For each node, it is expected that Ganeti will export what storage
+types it supports and pools it has access to. So a classic 2.2 cluster
+will have all nodes supporting ``internal/lvm`` and/or
+``internal/file``, whereas a new shared storage only 2.3 cluster could
+have ``external/my-nas`` storage.
+
+Whatever the mechanism that Ganeti will use internally to configure
+the associations between nodes and storage pools, we consider that
+we'll have available two node attributes inside htools: the list of internal
+and external storage pools.
+
+External storage and instances
+------------------------------
+
+Currently, for an instance we allow one cheap move type: failover to
+the current secondary, if it is a healthy node, and four other
+βexpensiveβ (as in, including data copies) moves that involve changing
+either the secondary or the primary node or both.
+
+In presence of an external storage type, the following things will
+change:
+
+- the disk-based moves will be disallowed; this is already a feature
+  in the algorithm, controlled by a boolean switch, so adapting
+  external storage here will be trivial
+- instead of the current one secondary node, the secondaries will
+  become a list of potential secondaries, based on access to the
+  instance's storage pool
+
+Except for this, the basic move algorithm remains unchanged.
+
+External storage and nodes
+--------------------------
+
+Two separate areas will have to change for nodes and external storage.
+
+First, then allocating instances (either as part of a move or a new
+allocation), if the instance is using external storage, then the
+internal disk metrics should be ignored (for both the primary and
+secondary cases).
+
+Second, the per-node metrics used in the cluster scoring must take
+into account that nodes might not have internal storage at all, and
+handle this as a well-balanced case (score 0).
+
+N+1 status
+----------
+
+Currently, computing the N+1 status of a node is simple:
+
+- group the current secondary instances by their primary node, and
+  compute the sum of each instance group memory
+- choose the maximum sum, and check if it's smaller than the current
+  available memory on this node
+
+In effect, computing the N+1 status is a per-node matter. However,
+with shared storage, we don't have secondary nodes, just potential
+secondaries. Thus computing the N+1 status will be a cluster-level
+matter, and much more expensive.
+
+A simple version of the N+1 checks would be that for each instance
+having said node as primary, we have enough memory in the cluster for
+relocation. This means we would actually need to run allocation
+checks, and update the cluster status from within allocation on one
+node, while being careful that we don't recursively check N+1 status
+during this relocation, which is too expensive.
+
+However, the shared storage model has some properties that changes the
+rules of the computation. Speaking broadly (and ignoring hard
+restrictions like tag based exclusion and CPU limits), the exact
+location of an instance in the cluster doesn't matter as long as
+memory is available. This results in two changes:
+
+- simply tracking the amount of free memory buckets is enough,
+  cluster-wide
+- moving an instance from one node to another would not change the N+1
+  status of any node, and only allocation needs to deal with N+1
+  checks
+
+Unfortunately, this very cheap solution fails in case of any other
+exclusion or prevention factors.
+
+TODO: find a solution for N+1 checks.
+
+
+Node groups support
+===================
+
+The addition of node groups has a small impact on the actual
+algorithms, which will simply operate at node group level instead of
+cluster level, but it requires the addition of new algorithms for
+inter-node group operations.
+
+The following two definitions will be used in the following
+paragraphs:
+
+local group
+  The local group refers to a node's own node group, or when speaking
+  about an instance, the node group of its primary node
+
+regular cluster
+  A cluster composed of a single node group, or pre-2.3 cluster
+
+super cluster
+  This term refers to a cluster which comprises multiple node groups,
+  as opposed to a 2.2 and earlier cluster with a single node group
+
+In all the below operations, it's assumed that Ganeti can gather the
+entire super cluster state cheaply.
+
+
+Balancing changes
+-----------------
+
+Balancing will move from cluster-level balancing to group
+balancing. In order to achieve a reasonable improvement in a super
+cluster, without needing to keep state of what groups have been
+already balanced previously, the balancing algorithm will run as
+follows:
+
+#. the cluster data is gathered
+#. if this is a regular cluster, as opposed to a super cluster,
+   balancing will proceed normally as previously
+#. otherwise, compute the cluster scores for all groups
+#. choose the group with the worst score and see if we can improve it;
+   if not choose the next-worst group, so on
+#. once a group has been identified, run the balancing for it
+
+Of course, explicit selection of a group will be allowed.
+
+Super cluster operations
+++++++++++++++++++++++++
+
+Beside the regular group balancing, in a super cluster we have more
+operations.
+
+
+Redistribution
+^^^^^^^^^^^^^^
+
+In a regular cluster, once we run out of resources (offline nodes
+which can't be fully evacuated, N+1 failures, etc.) there is nothing
+we can do unless nodes are added or instances are removed.
+
+In a super cluster however, there might be resources available in
+another group, so there is the possibility of relocating instances
+between groups to re-establish N+1 success within each group.
+
+One difficulty in the presence of both super clusters and shared
+storage is that the move paths of instances are quite complicated;
+basically an instance can move inside its local group, and to any
+other groups which have access to the same storage type and storage
+pool pair. In effect, the super cluster is composed of multiple
+βpartitionsβ, each containing one or more groups, but a node is
+simultaneously present in multiple partitions, one for each storage
+type and storage pool it supports. As such, the interactions between
+the individual partitions are too complex for non-trivial clusters to
+assume we can compute a perfect solution: we might need to move some
+instances using shared storage pool βAβ in order to clear some more
+memory to accept an instance using local storage, which will further
+clear more VCPUs in a third partition, etc. As such, we'll limit
+ourselves at simple relocation steps within a single partition.
+
+Algorithm:
+
+#. read super cluster data, and exit if cluster doesn't allow
+   inter-group moves
+#. filter out any groups that are βaloneβ in their partition
+   (i.e. no other group sharing at least one storage method)
+#. determine list of healthy versus unhealthy groups:
+
+  #. a group which contains offline nodes still hosting instances is
+     definitely not healthy
+  #. a group which has nodes failing N+1 is βweaklyβ unhealthy
+
+#. if either list is empty, exit (no work to do, or no way to fix problems)
+#. for each unhealthy group:
+
+  #. compute the instances that are causing the problems: all
+     instances living on offline nodes, all instances living as
+     secondary on N+1 failing nodes, all instances living as primaries
+     on N+1 failing nodes (in this order)
+  #. remove instances, one by one, until the source group is healthy
+    again
+  #. try to run a standard allocation procedure for each instance on
+     all potential groups in its partition
+  #. if all instances were relocated successfully, it means we have a
+     solution for repairing the original group
+
+Compression
+^^^^^^^^^^^
+
+In a super cluster which has had many instance reclamations, it is
+possible that while none of the groups is empty, overall there is
+enough empty capacity that an entire group could be removed.
+
+The algorithm for βcompressingβ the super cluster is as follows:
+
+#. read super cluster data
+#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)*
+   for the super-cluster
+#. computer per-group used and free *(memory, disk, cpu)*
+#. select candidate groups for evacuation:
+
+  #. they must be connected to other groups via a common storage type
+     and pool
+  #. they must have fewer used resources than the global free
+     resources (minus their own free resources)
+#. for each of these groups, try to relocate all its instances to
+   connected peer groups
+#. report the list of groups that could be evacuated, or if instructed
+   so, perform the evacuation of the group with the largest free
+   resources (i.e. in order to reclaim the most capacity)
+
+Load balancing
+^^^^^^^^^^^^^^
+
+Assuming a super cluster using shared storage, where instance failover
+is cheap, it should be possible to do a load-based balancing across
+groups.
+
+As opposed to the normal balancing, where we want to balance on all
+node attributes, here we should look only at the load attributes; in
+other words, compare the available (total) node capacity with the
+(total) load generated by instances in a given group, and computing
+such scores for all groups, trying to see if we have any outliers.
+
+Once a reliable load-weighting method for groups exists, we can apply
+a modified version of the cluster scoring method to score not
+imbalances across nodes, but imbalances across groups which result in
+a super cluster load-related score.
+
+Allocation changes
+------------------
+
+It is important to keep the allocation method across groups internal
+(in the Ganeti/Iallocator combination), instead of delegating it to an
+external party (e.g. a RAPI client). For this, the IAllocator protocol
+should be extended to provide proper group support.
+
+For htools, the new algorithm will work as follows:
+
+#. read/receive cluster data from Ganeti
+#. filter out any groups that do not supports the requested storage
+   method
+#. for remaining groups, try allocation and compute scores after
+   allocation
+#. sort valid allocation solutions accordingly and return the entire
+   list to Ganeti
+
+The rationale for returning the entire group list, and not only the
+best choice, is that we anyway have the list, and Ganeti might have
+other criteria (e.g. the best group might be busy/locked down, etc.)
+so even if from the point of view of resources it is the best choice,
+it might not be the overall best one.
+
+Node evacuation changes
+-----------------------
+
+While the basic concept in the ``multi-evac`` iallocator
+mode remains unchanged (it's a simple local group issue), when failing
+to evacuate and running in a super cluster, we could have resources
+available elsewhere in the cluster for evacuation.
+
+The algorithm for computing this will be the same as the one for super
+cluster compression and redistribution, except that the list of
+instances is fixed to the ones living on the nodes to-be-evacuated.
+
+If the inter-group relocation is successful, the result to Ganeti will
+not be a local group evacuation target, but instead (for each
+instance) a pair *(remote group, nodes)*. Ganeti itself will have to
+decide (based on user input) whether to continue with inter-group
+evacuation or not.
+
+In case that Ganeti doesn't provide complete cluster data, just the
+local group, the inter-group relocation won't be attempted.
-- 
GitLab