design-hsqueeze.rst 5.63 KB
Newer Older
Klaus Aehlig's avatar
Klaus Aehlig committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
=============
HSqueeze tool
=============

.. contents:: :depth: 4

This is a design document detailing the node-freeing scheduler, HSqueeze.


Current state and shortcomings
==============================

Externally-mirrored instances can be moved between nodes at low
cost. Therefore, it is attractive to free up nodes and power them down
at times of low usage, even for small periods of time, like nights or
weekends.

Currently, the best way to find out a suitable set of nodes to shut down
is to use the property of our balancedness metric to move instances
away from drained nodes. So, one would manually drain more and more
nodes and see, if `hbal` could find a solution freeing up all those
drained nodes.


Proposed changes
================

We propose the addition of a new htool command-line tool, called
`hsqueeze`, that aims at keeping resource usage at a constant high
level by evacuating and powering down nodes, or powering up nodes and
rebalancing, as appropriate. By default, only externally-mirrored
instances are moved, but options are provided to additionally take
DRBD instances (which can be moved without downtimes), or even all
instances into consideration.

Tagging of standy nodes
-----------------------

Powering down nodes that are technically healthy effectively creates a
new node state: nodes on standby. To avoid further state
proliferation, and as this information is only used by `hsqueeze`,
this information is recorded in node tags. `hsqueeze` will assume
that offline nodes having a tag with prefix `htools:standby:` can
easily be powered on at any time.

Minimum available resources
---------------------------

To keep the squeezed cluster functional, a minimal amount of resources
will be left available on every node. While the precise amount will
be specifiable via command-line options, a sensible default is chosen,
like enough resource to start an additional instance at standard
allocation on each node. If the available resources fall below this
limit, `hsqueeze` will, in fact, try to power on more nodes, till
enough resources are available, or all standy nodes are online.

To avoid flapping behavior, a second, higher, amount of reserve
resources can be specified, and `hsqueeze` will only power down nodes,
if after the power down this higher amount of reserve resources is
still available.

Computation of the set to free up
---------------------------------

To determine which nodes can be powered down, `hsqueeze` basically
follows the same algorithm as the manual process. It greedily goes
through all non-master nodes and tries if the algorithm used by `hbal`
would find a solution (with the appropriate move restriction) that
frees up the extended set of nodes to be drained, while keeping enough
resources free. Being based on the algorithm used by `hbal`, all
restrictions respected by `hbal`, in particular memory reservation
for N+1 redundancy, are also respected by `hsqueeze`.
The order in which the nodes are tried is choosen by a
suitable heuristics, like trying the nodes in order of increasing
number of instances; the hope is that this reduces the number of
instances that actually have to be moved.

If the amount of free resources has fallen below the lower limit,
`hsqueeze` will determine the set of nodes to power up in a similar
way; it will hypothetically add more and more of the standby
nodes (in some suitable order) till the algorithm used by `hbal` will
finally balance the cluster in a way that enough resources are available,
or all standy nodes are online.


Instance moves and execution
----------------------------

Once the final set of nodes to power down is determined, the instance
moves are determined by the algorithm used by `hbal`. If
requested by the `-X` option, the nodes freed up are drained, and the
instance moves are executed in the same way as `hbal` does. Finally,
those of the freed-up nodes that do not already have a
`htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
nodes are marked as offline and powered down via the
:doc:`design-oob`.

Similarly, if it is determined that nodes need to be added, then first
the nodes are powered up via the :doc:`design-oob`, then they're marked
as online and finally,
the cluster is balanced in the same way, as `hbal` would do. For the
newly powered up nodes, the `htools:standby:auto` tag, if present, is
removed, but no other tags are removed (including other
`htools:standby:` tags).


Design choices
==============

The proposed algorithm builds on top of the already present balancing
algorithm, instead of greedily packing nodes as full as possible. The
reason is, that in the end, a balanced cluster is needed anyway;
therefore, basing on the balancing algorithm reduces the number of
instance moves. Additionally, the final configuration will also
benefit from all improvements to the balancing algorithm, like taking
dynamic CPU data into account.

We decided to have a separate program instead of adding an option to
`hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
not unlikely that, over time, additional `hsqueeze`-specific options
might be added, specifying, e.g., which nodes to prefer for
shutdown. With the approach of the `htools` of having a single binary
showing different behaviors, having an additional program also does not
introduce significant additional cost.

We decided to have a whole prefix instead of a single tag reserved
for marking standby nodes (we consider all tags starting with
`htools:standby:` as serving only this purpose). This is not only in
accordance with the tag
reservations for other tools, but it also allows for further extension
(like specifying priorities on which nodes to power up first) without
changing name spaces.