From 92921ea4ec3cbcbab7267491ade2153c0faa73b6 Mon Sep 17 00:00:00 2001 From: Iustin Pop <iustin@google.com> Date: Fri, 10 Sep 2010 17:10:26 +0200 Subject: [PATCH] Add design for htools/Ganeti 2.3 sync This is a work in progress, will be modified along with the progress of Ganeti 2.3. --- doc/design-ganeti-2.3.rst | 320 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 doc/design-ganeti-2.3.rst diff --git a/doc/design-ganeti-2.3.rst b/doc/design-ganeti-2.3.rst new file mode 100644 index 000000000..a6beabd44 --- /dev/null +++ b/doc/design-ganeti-2.3.rst @@ -0,0 +1,320 @@ +==================================== + Synchronising htools to Ganeti 2.3 +==================================== + +Ganeti 2.3 introduces a number of new features that change the cluster +internals significantly enough that the htools suite needs to be +updated accordingly in order to function correctly. + +Shared storage support +====================== + +Currently, the htools algorithms presume a model where all of an +instance's resources is served from within the cluster, more +specifically from the nodes comprising the cluster. While is this +usual for memory and CPU, deployments which use shared storage will +invalidate this assumption for storage. + +To account for this, we need to move some assumptions from being +implicit (and hardcoded) to being explicitly exported from Ganeti. + + +New instance parameters +----------------------- + +It is presumed that Ganeti will export for all instances a new +``storage_type`` parameter, that will denote either internal storage +(e.g. *plain* or *drbd*), or external storage. + +Furthermore, a new ``storage_pool`` parameter will classify, for both +internal and external storage, the pool out of which the storage is +allocated. For internal storage, this will be either ``lvm`` (the pool +that provides space to both ``plain`` and ``drbd`` instances) or +``file`` (for file-storage-based instances). For external storage, +this will be the respective NAS/SAN/cloud storage that backs up the +instance. Note that for htools, external storage pools are opaque; we +only care that they have an identifier, so that we can distinguish +between two different pools. + +If these two parameters are not present, the instances will be +presumed to be ``internal/lvm``. + +New node parameters +------------------- + +For each node, it is expected that Ganeti will export what storage +types it supports and pools it has access to. So a classic 2.2 cluster +will have all nodes supporting ``internal/lvm`` and/or +``internal/file``, whereas a new shared storage only 2.3 cluster could +have ``external/my-nas`` storage. + +Whatever the mechanism that Ganeti will use internally to configure +the associations between nodes and storage pools, we consider that +we'll have available two node attributes inside htools: the list of internal +and external storage pools. + +External storage and instances +------------------------------ + +Currently, for an instance we allow one cheap move type: failover to +the current secondary, if it is a healthy node, and four other +βexpensiveβ (as in, including data copies) moves that involve changing +either the secondary or the primary node or both. + +In presence of an external storage type, the following things will +change: + +- the disk-based moves will be disallowed; this is already a feature + in the algorithm, controlled by a boolean switch, so adapting + external storage here will be trivial +- instead of the current one secondary node, the secondaries will + become a list of potential secondaries, based on access to the + instance's storage pool + +Except for this, the basic move algorithm remains unchanged. + +External storage and nodes +-------------------------- + +Two separate areas will have to change for nodes and external storage. + +First, then allocating instances (either as part of a move or a new +allocation), if the instance is using external storage, then the +internal disk metrics should be ignored (for both the primary and +secondary cases). + +Second, the per-node metrics used in the cluster scoring must take +into account that nodes might not have internal storage at all, and +handle this as a well-balanced case (score 0). + +N+1 status +---------- + +Currently, computing the N+1 status of a node is simple: + +- group the current secondary instances by their primary node, and + compute the sum of each instance group memory +- choose the maximum sum, and check if it's smaller than the current + available memory on this node + +In effect, computing the N+1 status is a per-node matter. However, +with shared storage, we don't have secondary nodes, just potential +secondaries. Thus computing the N+1 status will be a cluster-level +matter, and much more expensive. + +A simple version of the N+1 checks would be that for each instance +having said node as primary, we have enough memory in the cluster for +relocation. This means we would actually need to run allocation +checks, and update the cluster status from within allocation on one +node, while being careful that we don't recursively check N+1 status +during this relocation, which is too expensive. + +However, the shared storage model has some properties that changes the +rules of the computation. Speaking broadly (and ignoring hard +restrictions like tag based exclusion and CPU limits), the exact +location of an instance in the cluster doesn't matter as long as +memory is available. This results in two changes: + +- simply tracking the amount of free memory buckets is enough, + cluster-wide +- moving an instance from one node to another would not change the N+1 + status of any node, and only allocation needs to deal with N+1 + checks + +Unfortunately, this very cheap solution fails in case of any other +exclusion or prevention factors. + +TODO: find a solution for N+1 checks. + + +Node groups support +=================== + +The addition of node groups has a small impact on the actual +algorithms, which will simply operate at node group level instead of +cluster level, but it requires the addition of new algorithms for +inter-node group operations. + +The following two definitions will be used in the following +paragraphs: + +local group + The local group refers to a node's own node group, or when speaking + about an instance, the node group of its primary node + +regular cluster + A cluster composed of a single node group, or pre-2.3 cluster + +super cluster + This term refers to a cluster which comprises multiple node groups, + as opposed to a 2.2 and earlier cluster with a single node group + +In all the below operations, it's assumed that Ganeti can gather the +entire super cluster state cheaply. + + +Balancing changes +----------------- + +Balancing will move from cluster-level balancing to group +balancing. In order to achieve a reasonable improvement in a super +cluster, without needing to keep state of what groups have been +already balanced previously, the balancing algorithm will run as +follows: + +#. the cluster data is gathered +#. if this is a regular cluster, as opposed to a super cluster, + balancing will proceed normally as previously +#. otherwise, compute the cluster scores for all groups +#. choose the group with the worst score and see if we can improve it; + if not choose the next-worst group, so on +#. once a group has been identified, run the balancing for it + +Of course, explicit selection of a group will be allowed. + +Super cluster operations +++++++++++++++++++++++++ + +Beside the regular group balancing, in a super cluster we have more +operations. + + +Redistribution +^^^^^^^^^^^^^^ + +In a regular cluster, once we run out of resources (offline nodes +which can't be fully evacuated, N+1 failures, etc.) there is nothing +we can do unless nodes are added or instances are removed. + +In a super cluster however, there might be resources available in +another group, so there is the possibility of relocating instances +between groups to re-establish N+1 success within each group. + +One difficulty in the presence of both super clusters and shared +storage is that the move paths of instances are quite complicated; +basically an instance can move inside its local group, and to any +other groups which have access to the same storage type and storage +pool pair. In effect, the super cluster is composed of multiple +βpartitionsβ, each containing one or more groups, but a node is +simultaneously present in multiple partitions, one for each storage +type and storage pool it supports. As such, the interactions between +the individual partitions are too complex for non-trivial clusters to +assume we can compute a perfect solution: we might need to move some +instances using shared storage pool βAβ in order to clear some more +memory to accept an instance using local storage, which will further +clear more VCPUs in a third partition, etc. As such, we'll limit +ourselves at simple relocation steps within a single partition. + +Algorithm: + +#. read super cluster data, and exit if cluster doesn't allow + inter-group moves +#. filter out any groups that are βaloneβ in their partition + (i.e. no other group sharing at least one storage method) +#. determine list of healthy versus unhealthy groups: + + #. a group which contains offline nodes still hosting instances is + definitely not healthy + #. a group which has nodes failing N+1 is βweaklyβ unhealthy + +#. if either list is empty, exit (no work to do, or no way to fix problems) +#. for each unhealthy group: + + #. compute the instances that are causing the problems: all + instances living on offline nodes, all instances living as + secondary on N+1 failing nodes, all instances living as primaries + on N+1 failing nodes (in this order) + #. remove instances, one by one, until the source group is healthy + again + #. try to run a standard allocation procedure for each instance on + all potential groups in its partition + #. if all instances were relocated successfully, it means we have a + solution for repairing the original group + +Compression +^^^^^^^^^^^ + +In a super cluster which has had many instance reclamations, it is +possible that while none of the groups is empty, overall there is +enough empty capacity that an entire group could be removed. + +The algorithm for βcompressingβ the super cluster is as follows: + +#. read super cluster data +#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)* + for the super-cluster +#. computer per-group used and free *(memory, disk, cpu)* +#. select candidate groups for evacuation: + + #. they must be connected to other groups via a common storage type + and pool + #. they must have fewer used resources than the global free + resources (minus their own free resources) +#. for each of these groups, try to relocate all its instances to + connected peer groups +#. report the list of groups that could be evacuated, or if instructed + so, perform the evacuation of the group with the largest free + resources (i.e. in order to reclaim the most capacity) + +Load balancing +^^^^^^^^^^^^^^ + +Assuming a super cluster using shared storage, where instance failover +is cheap, it should be possible to do a load-based balancing across +groups. + +As opposed to the normal balancing, where we want to balance on all +node attributes, here we should look only at the load attributes; in +other words, compare the available (total) node capacity with the +(total) load generated by instances in a given group, and computing +such scores for all groups, trying to see if we have any outliers. + +Once a reliable load-weighting method for groups exists, we can apply +a modified version of the cluster scoring method to score not +imbalances across nodes, but imbalances across groups which result in +a super cluster load-related score. + +Allocation changes +------------------ + +It is important to keep the allocation method across groups internal +(in the Ganeti/Iallocator combination), instead of delegating it to an +external party (e.g. a RAPI client). For this, the IAllocator protocol +should be extended to provide proper group support. + +For htools, the new algorithm will work as follows: + +#. read/receive cluster data from Ganeti +#. filter out any groups that do not supports the requested storage + method +#. for remaining groups, try allocation and compute scores after + allocation +#. sort valid allocation solutions accordingly and return the entire + list to Ganeti + +The rationale for returning the entire group list, and not only the +best choice, is that we anyway have the list, and Ganeti might have +other criteria (e.g. the best group might be busy/locked down, etc.) +so even if from the point of view of resources it is the best choice, +it might not be the overall best one. + +Node evacuation changes +----------------------- + +While the basic concept in the ``multi-evac`` iallocator +mode remains unchanged (it's a simple local group issue), when failing +to evacuate and running in a super cluster, we could have resources +available elsewhere in the cluster for evacuation. + +The algorithm for computing this will be the same as the one for super +cluster compression and redistribution, except that the list of +instances is fixed to the ones living on the nodes to-be-evacuated. + +If the inter-group relocation is successful, the result to Ganeti will +not be a local group evacuation target, but instead (for each +instance) a pair *(remote group, nodes)*. Ganeti itself will have to +decide (based on user input) whether to continue with inter-group +evacuation or not. + +In case that Ganeti doesn't provide complete cluster data, just the +local group, the inter-group relocation won't be attempted. -- GitLab