From d85f01e740047c985061fe7b91b1305ac5ab6727 Mon Sep 17 00:00:00 2001
From: Iustin Pop <iustin@google.com>
Date: Tue, 20 Sep 2011 13:39:58 +0900
Subject: [PATCH] Add design doc for the resource model changes

This is not complete, but is as close as I can get it for now. I
expect people actually implementing the various changes to extend the
design doc.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
---
 Makefile.am                   |   1 +
 doc/design-draft.rst          |   1 +
 doc/design-resource-model.rst | 899 ++++++++++++++++++++++++++++++++++
 3 files changed, 901 insertions(+)
 create mode 100644 doc/design-resource-model.rst

diff --git a/Makefile.am b/Makefile.am
index 3d855aad9..9cc5b8094 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -306,6 +306,7 @@ docrst = \
 	doc/design-network.rst \
 	doc/design-chained-jobs.rst \
 	doc/design-ovf-support.rst \
+	doc/design-resource-model.rst \
 	doc/cluster-merge.rst \
 	doc/design-shared-storage.rst \
 	doc/design-node-state-cache.rst \
diff --git a/doc/design-draft.rst b/doc/design-draft.rst
index 6ac771e5f..2f0510371 100644
--- a/doc/design-draft.rst
+++ b/doc/design-draft.rst
@@ -12,6 +12,7 @@ Design document drafts
    design-ovf-support.rst
    design-network.rst
    design-node-state-cache.rst
+   design-resource-model.rst
 
 .. vim: set textwidth=72 :
 .. Local Variables:
diff --git a/doc/design-resource-model.rst b/doc/design-resource-model.rst
new file mode 100644
index 000000000..c4bbc11a8
--- /dev/null
+++ b/doc/design-resource-model.rst
@@ -0,0 +1,899 @@
+========================
+ Resource model changes
+========================
+
+
+Introduction
+============
+
+In order to manage virtual machines across the cluster, Ganeti needs to
+understand the resources present on the nodes, the hardware and software
+limitations of the nodes, and how much can be allocated safely on each
+node. Some of these decisions are delegated to IAllocator plugins, for
+easier site-level customisation.
+
+Similarly, the HTools suite has an internal model that simulates the
+hardware resource changes in response to Ganeti operations, in order to
+provide both an iallocator plugin and for balancing the
+cluster.
+
+While currently the HTools model is much more advanced than Ganeti's,
+neither one is flexible enough and both are heavily geared toward a
+specific Xen model; they fail to work well with (e.g.) KVM or LXC, or
+with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics
+contained in the models is limited to historic requirements and fails to
+account for (e.g.)  heterogeneity in the I/O performance of the nodes.
+
+Current situation
+=================
+
+Ganeti
+------
+
+At this moment, Ganeti itself doesn't do any static modelling of the
+cluster resources. It only does some runtime checks:
+
+- when creating instances, for the (current) free disk space
+- when starting instances, for the (current) free memory
+- during cluster verify, for enough N+1 memory on the secondaries, based
+  on the (current) free memory
+
+Basically this model is a pure :term:`SoW` one, and it works well when
+there are other instances/LVs on the nodes, as it allows Ganeti to deal
+with βorphanβ resource usage, but on the other hand it has many issues,
+described below.
+
+HTools
+------
+
+Since HTools does an pure in-memory modelling of the cluster changes as
+it executes the balancing or allocation steps, it had to introduce a
+static (:term:`SoR`) cluster model.
+
+The model is constructed based on the received node properties from
+Ganeti (hence it basically is constructed on what Ganeti can export).
+
+Disk
+~~~~
+
+For disk it consists of just the total (``tdsk``) and the free disk
+space (``fdsk``); we don't directly track the used disk space. On top of
+this, we compute and warn if the sum of disk sizes used by instance does
+not match with ``tdsk - fdsk``, but otherwise we do not track this
+separately.
+
+Memory
+~~~~~~
+
+For memory, the model is more complex and tracks some variables that
+Ganeti itself doesn't compute. We start from the total (``tmem``), free
+(``fmem``) and node memory (``nmem``) as supplied by Ganeti, and
+additionally we track:
+
+instance memory (``imem``)
+    the total memory used by primary instances on the node, computed
+    as the sum of instance memory
+
+reserved memory (``rmem``)
+    the memory reserved by peer nodes for N+1 redundancy; this memory is
+    tracked per peer-node, and the maximum value out of the peer memory
+    lists is the node's ``rmem``; when not using DRBD, this will be
+    equal to zero
+
+unaccounted memory (``xmem``)
+    memory that cannot be unaccounted for via the Ganeti model; this is
+    computed at startup as::
+
+        tmem - imem - nmem - fmem
+
+    and is presumed to remain constant irrespective of any instance
+    moves
+
+available memory (``amem``)
+    this is simply ``fmem - rmem``, so unless we use DRBD, this will be
+    equal to ``fmem``
+
+``tmem``, ``nmem`` and ``xmem`` are presumed constant during the
+instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem``
+values are updated according to the executed moves.
+
+CPU
+~~~
+
+The CPU model is different than the disk/memory models, since it's the
+only one where:
+
+#. we do oversubscribe physical CPUs
+#. and there is no natural limit for the number of VCPUs we can allocate
+
+We therefore track the total number of VCPUs used on the node and the
+number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to
+make this somewhat more similar to the other resources which are
+limited.
+
+Dynamic load
+~~~~~~~~~~~~
+
+There is also a model that deals with *dynamic load* values in
+htools. As far as we know, it is not currently used actually with load
+values, but it is active by default with unitary values for all
+instances; it currently tracks these metrics:
+
+- disk load
+- memory load
+- cpu load
+- network load
+
+Even though we do not assign real values to these load values, the fact
+that we at least sum them means that the algorithm tries to equalise
+these loads, and especially the network load, which is otherwise not
+tracked at all. The practical result (due to a combination of these four
+metrics) is that the number of secondaries will be balanced.
+
+Limitations
+-----------
+
+
+There are unfortunately many limitations to the current model.
+
+Memory
+~~~~~~
+
+The memory model doesn't work well in case of KVM. For Xen, the memory
+for the node (i.e. ``dom0``) can be static or dynamic; we don't support
+the latter case, but for the former case, the static value is configured
+in Xen/kernel command line, and can be queried from Xen
+itself. Therefore, Ganeti can query the hypervisor for the memory used
+for the node; the same model was adopted for the chroot/KVM/LXC
+hypervisors, but in these cases there's no natural value for the memory
+used by the base OS/kernel, and we currently try to compute a value for
+the node memory based on current consumption. This, being variable,
+breaks the assumptions in both Ganeti and HTools.
+
+This problem also shows for the free memory: if the free memory on the
+node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or
+if the node and instance memory are pooled together (Linux-based
+hypervisors like KVM and LXC), the current value of the free memory is
+meaningless and cannot be used for instance checks.
+
+A separate issue related to the free memory tracking is that since we
+don't track memory use but rather memory availability, an instance that
+is temporary down changes Ganeti's understanding of the memory status of
+the node. This can lead to problems such as:
+
+.. digraph:: "free-mem-issue"
+
+  node  [shape=box];
+  inst1 [label="instance1"];
+  inst2 [label="instance2"];
+
+  node  [shape=note];
+  nodeA [label="fmem=0"];
+  nodeB [label="fmem=1"];
+  nodeC [label="fmem=0"];
+
+  node  [shape=ellipse, style=filled, fillcolor=green]
+
+  {rank=same; inst1 inst2}
+
+  stop    [label="crash!", fillcolor=orange];
+  migrate [label="migrate/ok"];
+  start   [style=filled, fillcolor=red, label="start/fail"];
+  inst1   -> stop -> start;
+  stop    -> migrate -> start [style=invis, weight=0];
+  inst2   -> migrate;
+
+  {rank=same; inst1 inst2 nodeA}
+  {rank=same; stop nodeB}
+  {rank=same; migrate nodeC}
+
+  nodeA -> nodeB -> nodeC [style=invis, weight=1];
+
+The behaviour here is wrong; the migration of *instance2* to the node in
+question will succeed or fail depending on whether *instance1* is
+running or not. And for *instance1*, it can lead to cases where it if
+crashes, it cannot restart anymore.
+
+Finally, not a problem but rather a missing important feature is support
+for memory over-subscription: both Xen and KVM support memory
+ballooning, even automatic memory ballooning, for a while now. The
+entire memory model is based on a fixed memory size for instances, and
+if memory ballooning is enabled, it will βbreakβ the HTools
+algorithm. Even the fact that KVM instances do not use all memory from
+the start creates problems (although not as high, since it will grow and
+stabilise in the end).
+
+Disks
+~~~~~
+
+Because we only track disk space currently, this means if we have a
+cluster of ``N`` otherwise identical nodes but half of them have 10
+drives of size ``X`` and the other half 2 drives of size ``5X``, HTools
+will consider them exactly the same. However, in the case of mechanical
+drives at least, the I/O performance will differ significantly based on
+spindle count, and a βfairβ load distribution should take this into
+account (a similar comment can be made about processor/memory/network
+speed).
+
+Another problem related to the spindle count is the LVM allocation
+algorithm. Currently, the algorithm always creates (or tries to create)
+striped volumes, with the stripe count being hard-coded to the
+``./configure`` parameter ``--with-lvm-stripecount``. This creates
+problems like:
+
+- when installing from a distribution package, all clusters will be
+  either limited or overloaded due to this fixed value
+- it is not possible to mix heterogeneous nodes (even in different node
+  groups) and have optimal settings for all nodes
+- the striping value applies both to LVM/DRBD data volumes (which are on
+  the order of gigabytes to hundreds of gigabytes) and to DRBD metadata
+  volumes (whose size is always fixed at 128MB); when stripping such
+  small volumes over many PVs, their size will increase needlessly (and
+  this can confuse HTools' disk computation algorithm)
+
+Moreover, the allocation currently allocates based on a βmost free
+spaceβ algorithm. This balances the free space usage on disks, but on
+the other hand it tends to mix rather badly the data and metadata
+volumes of different instances. For example, it cannot do the following:
+
+- keep DRBD data and metadata volumes on the same drives, in order to
+  reduce exposure to drive failure in a many-drives system
+- keep DRBD data and metadata volumes on different drives, to reduce
+  performance impact of metadata writes
+
+Additionally, while Ganeti supports setting the volume separately for
+data and metadata volumes at instance creation, there are no defaults
+for this setting.
+
+Similar to the above stripe count problem (which is about not good
+enough customisation of Ganeti's behaviour), we have limited
+pass-through customisation of the various options of our storage
+backends; while LVM has a system-wide configuration file that can be
+used to tweak some of its behaviours, for DRBD we don't use the
+:command:`drbdadmin` tool, and instead we call :command:`drbdsetup`
+directly, with a fixed/restricted set of options; so for example one
+cannot tweak the buffer sizes.
+
+Another current problem is that the support for shared storage in HTools
+is still limited, but this problem is outside of this design document.
+
+Locking
+~~~~~~~
+
+A further problem generated by the βcurrent freeβ model is that during a
+long operation which affects resource usage (e.g. disk replaces,
+instance creations) we have to keep the respective objects locked
+(sometimes even in exclusive mode), since we don't want any concurrent
+modifications to the *free* values.
+
+A classic example of the locking problem is the following:
+
+.. digraph:: "iallocator-lock-issues"
+
+  rankdir=TB;
+
+  start [style=invis];
+  node  [shape=box,width=2];
+  job1  [label="add instance\niallocator run\nchoose A,B"];
+  job1e [label="finish add"];
+  job2  [label="add instance\niallocator run\nwait locks"];
+  job2s [label="acquire locks\nchoose C,D"];
+  job2e [label="finish add"];
+
+  job1  -> job1e;
+  job2  -> job2s -> job2e;
+  edge [style=invis,weight=0];
+  start -> {job1; job2}
+  job1  -> job2;
+  job2  -> job1e;
+  job1e -> job2s [style=dotted,label="release locks"];
+
+In the above example, the second IAllocator run will wait for locks for
+nodes ``A`` and ``B``, even though in the end the second instance will
+be placed on another set of nodes (``C`` and ``D``). This wait shouldn't
+be needed, since right after the first IAllocator run has finished,
+:command:`hail` knows the status of the cluster after the allocation,
+and it could answer the question for the second run too; however, Ganeti
+doesn't have such visibility into the cluster state and thus it is
+forced to wait with the second job.
+
+Similar examples can be made about replace disks (another long-running
+opcode).
+
+.. _label-policies:
+
+Policies
+~~~~~~~~
+
+For most of the resources, we have metrics defined by policy: e.g. the
+over-subscription ratio for CPUs, the amount of space to reserve,
+etc. Furthermore, although there are no such definitions in Ganeti such
+as minimum/maximum instance size, a real deployment will need to have
+them, especially in a fully-automated workflow where end-users can
+request instances via an automated interface (that talks to the cluster
+via RAPI, LUXI or command line). However, such an automated interface
+will need to also take into account cluster capacity, and if the
+:command:`hspace` tool is used for the capacity computation, it needs to
+be told the maximum instance size, however it has a built-in minimum
+instance size which is not customisable.
+
+It is clear that this situation leads to duplicate definition of
+resource policies which makes it hard to easily change per-cluster (or
+globally) the respective policies, and furthermore it creates
+inconsistencies if such policies are not enforced at the source (i.e. in
+Ganeti).
+
+Balancing algorithm
+~~~~~~~~~~~~~~~~~~~
+
+The balancing algorithm, as documented in the HTools ``README`` file,
+tries to minimise the cluster score; this score is based on a set of
+metrics that describe both exceptional conditions and how spread the
+instances are across the nodes. In order to achieve this goal, it moves
+the instances around, with a series of moves of various types:
+
+- disk replaces (for DRBD-based instances)
+- instance failover/migrations (for all types)
+
+However, the algorithm only looks at the cluster score, and not at the
+*βcostβ* of the moves. In other words, the following can and will happen
+on a cluster:
+
+.. digraph:: "balancing-cost-issues"
+
+  rankdir=LR;
+  ranksep=1;
+
+  start     [label="score Ξ±", shape=hexagon];
+
+  node      [shape=box, width=2];
+  replace1  [label="replace_disks 500G\nscore Ξ±-3Ξ΅\ncost 3"];
+  replace2a [label="replace_disks 20G\nscore Ξ±-2Ξ΅\ncost 2"];
+  migrate1  [label="migrate\nscore Ξ±-Ξ΅\ncost 1"];
+
+  choose    [shape=ellipse,label="choose min(score)=Ξ±-3Ξ΅\ncost 3"];
+
+  start -> {replace1; replace2a; migrate1} -> choose;
+
+Even though a migration is much, much cheaper than a disk replace (in
+terms of network and disk traffic on the cluster), if the disk replace
+results in a score infinitesimally smaller, then it will be
+chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB``
+and one moving ``20GiB``, the first one will be chosen if it results in
+a score smaller than the second one. Furthermore, even if the resulting
+scores are equal, the first computed solution will be kept, whichever it
+is.
+
+Fixing this algorithmic problem is doable, but currently Ganeti doesn't
+export enough information about nodes to make an informed decision; in
+the above example, if the ``500GiB`` move is between nodes having fast
+I/O (both disks and network), it makes sense to execute it over a disk
+replace of ``100GiB`` between nodes with slow I/O, so simply relating to
+the properties of the move itself is not enough; we need more node
+information for cost computation.
+
+Allocation algorithm
+~~~~~~~~~~~~~~~~~~~~
+
+.. note:: This design document will not address this limitation, but it
+  is worth mentioning as it directly related to the resource model.
+
+The current allocation/capacity algorithm works as follows (per
+node-group)::
+
+    repeat:
+        allocate instance without failing N+1
+
+This simple algorithm, and its use of ``N+1`` criterion, has a built-in
+limit of 1 machine failure in case of DRBD. This means the algorithm
+guarantees that, if using DRBD storage, there are enough resources to
+(re)start all affected instances in case of one machine failure. This
+relates mostly to memory; there is no account for CPU over-subscription
+(i.e. in case of failure, make sure we can failover while still not
+going over CPU limits), or for any other resource.
+
+In case of shared storage, there's not even the memory guarantee, as the
+N+1 protection doesn't work for shared storage.
+
+If a given cluster administrator wants to survive up to two machine
+failures, or wants to ensure CPU limits too for DRBD, there is no
+possibility to configure this in HTools (neither in :command:`hail` nor
+in :command:`hspace`). Current workaround employ for example deducting a
+certain number of instances from the size computed by :command:`hspace`,
+but this is a very crude method, and requires that instance creations
+are limited before Ganeti (otherwise :command:`hail` would allocate
+until the cluster is full).
+
+Proposed architecture
+=====================
+
+
+There are two main changes proposed:
+
+- changing the resource model from a pure :term:`SoW` to a hybrid
+  :term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is
+  heavily emphasised
+- extending the resource model to cover additional properties,
+  completing the βholesβ in the current coverage
+
+The second change is rather straightforward, but will add more
+complexity in the modelling of the cluster. The first change, however,
+represents a significant shift from the current model, which Ganeti had
+from its beginnings.
+
+Lock-improved resource model
+----------------------------
+
+Hybrid SoR/SoW model
+~~~~~~~~~~~~~~~~~~~~
+
+The resources of a node can be characterised in two broad classes:
+
+- mostly static resources
+- dynamically changing resources
+
+In the first category, we have things such as total core count, total
+memory size, total disk size, number of network interfaces etc. In the
+second category we have things such as free disk space, free memory, CPU
+load, etc. Note that nowadays we don't have (anymore) fully-static
+resources: features like CPU and memory hot-plug, online disk replace,
+etc. mean that theoretically all resources can change (there are some
+practical limitations, of course).
+
+Even though the rate of change of the two resource types is wildly
+different, right now Ganeti handles both the same. Given that the
+interval of change of the semi-static ones is much bigger than most
+Ganeti operations, even more than lengthy sequences of Ganeti jobs, it
+makes sense to treat them separately.
+
+The proposal is then to move the following resources into the
+configuration and treat the configuration as the authoritative source
+for them (a :term:`SoR` model):
+
+- CPU resources:
+    - total core count
+    - node core usage (*new*)
+- memory resources:
+    - total memory size
+    - node memory size
+    - hypervisor overhead (*new*)
+- disk resources:
+    - total disk size
+    - disk overhead (*new*)
+
+Since these resources can though change at run-time, we will need
+functionality to update the recorded values.
+
+Pre-computing dynamic resource values
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Remember that the resource model used by HTools models the clusters as
+obeying the following equations:
+
+  disk\ :sub:`free` = disk\ :sub:`total` - β disk\ :sub:`instances`
+
+  mem\ :sub:`free` = mem\ :sub:`total` - β mem\ :sub:`instances` - mem\
+  :sub:`node` - mem\ :sub:`overhead`
+
+As this model worked fine for HTools, we can consider it valid and adopt
+it in Ganeti. Furthermore, note that all values in the right-hand side
+come now from the configuration:
+
+- the per-instance usage values were already stored in the configuration
+- the other values will are moved to the configuration per the previous
+  section
+
+This means that we can now compute the free values without having to
+actually live-query the nodes, which brings a significant advantage.
+
+There are a couple of caveats to this model though. First, as the
+run-time state of the instance is no longer taken into consideration, it
+means that we have to introduce a new *offline* state for an instance
+(similar to the node one). In this state, the instance's runtime
+resources (memory and VCPUs) are no longer reserved for it, and can be
+reused by other instances. Static resources like disk and MAC addresses
+are still reserved though. Transitioning into and out of this reserved
+state will be more involved than simply stopping/starting the instance
+(e.g. de-offlining can fail due to missing resources). This complexity
+is compensated by the increased consistency of what guarantees we have
+in the stopped state (we always guarantee resource reservation), and the
+potential for management tools to restrict which users can transition
+into/out of this state separate from which users can stop/start the
+instance.
+
+Separating per-node resource locks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many of the current node locks in Ganeti exist in order to guarantee
+correct resource state computation, whereas others are designed to
+guarantee reasonable run-time performance of nodes (e.g. by not
+overloading the I/O subsystem). This is an unfortunate coupling, since
+it means for example that the following two operations conflict in
+practice even though they are orthogonal:
+
+- replacing a instance's disk on a node
+- computing node disk/memory free for an IAllocator run
+
+This conflict increases significantly the lock contention on a big/busy
+cluster and at odds with the goal of increasing the cluster size.
+
+The proposal is therefore to add a new level of locking that is only
+used to prevent concurrent modification to the resource states (either
+node properties or instance properties) and not for long-term
+operations:
+
+- instance creation needs to acquire and keep this lock until adding the
+  instance to the configuration
+- instance modification needs to acquire and keep this lock until
+  updating the instance
+- node property changes will need to acquire this lock for the
+  modification
+
+The new lock level will sit before the instance level (right after BGL)
+and could either be single-valued (like the βBig Ganeti Lockβ), in which
+case we won't be able to modify two nodes at the same time, or per-node,
+in which case the list of locks at this level needs to be synchronised
+with the node lock level. To be determined.
+
+Lock contention reduction
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Based on the above, the locking contention will be reduced as follows:
+IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock,
+only the resource lock (in exclusive mode). Hence allocating/computing
+evacuation targets will no longer conflict for longer than the time to
+compute the allocation solution.
+
+The remaining long-running locks will be the DRBD replace-disks ones
+(exclusive mode). These can also be removed, or changed into shared
+locks, but that is a separate design change.
+
+.. admonition:: FIXME
+
+  Need to rework instance console vs. instance replace disks. I don't
+  think we need exclusive locks for console and neither for replace
+  disk: it is safe to stop/start the instance while it's doing a replace
+  disks. Only modify would need exclusive, and only for transitioning
+  into/out of offline state.
+
+Instance memory model
+---------------------
+
+In order to support ballooning, the instance memory model needs to be
+changed from a βmemory sizeβ one to a βmin/max memory sizeβ. This
+interacts with the new static resource model, however, and thus we need
+to declare a-priori the expected oversubscription ratio on the cluster.
+
+The new minimum memory size parameter will be similar to the current
+memory size; the cluster will guarantee that in all circumstances, all
+instances will have available their minimum memory size. The maximum
+memory size will permit burst usage of more memory by instances, with
+the restriction that the sum of maximum memory usage will not be more
+than the free memory times the oversubscription factor:
+
+    β memory\ :sub:`min` β€ memory\ :sub:`available`
+
+    β memory\ :sub:`max` β€ memory\ :sub:`free` * oversubscription_ratio
+
+The hypervisor will have the possibility of adjusting the instance's
+memory size dynamically between these two boundaries.
+
+Note that the minimum memory is related to the available memory on the
+node, whereas the maximum memory is related to the free memory. On
+DRBD-enabled clusters, this will have the advantage of using the
+reserved memory for N+1 failover for burst usage, instead of having it
+completely idle.
+
+.. admonition:: FIXME
+
+  Need to document how Ganeti forces minimum size at runtime, overriding
+  the hypervisor, in cases of failover/lack of resources.
+
+New parameters
+--------------
+
+Unfortunately the design will add a significant number of new
+parameters, and change the meaning of some of the current ones.
+
+Instance size limits
+~~~~~~~~~~~~~~~~~~~~
+
+As described in :ref:`label-policies`, we currently lack a clear
+definition of the support instance sizes (minimum, maximum and
+standard). As such, we will add the following structure to the cluster
+parameters:
+
+- ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance
+  specs
+- ``std_ispec``: standard instance size, which will be used for capacity
+  computations and for default parameters on the instance creation
+  request
+
+Ganeti will by default reject non-standard instance sizes (lower than
+``min_ispec`` or greater than ``max_ispec``), but as usual a ``--force``
+option on the command line or in the RAPI request will override these
+constraints. The ``std_spec`` structure will be used to fill in missing
+instance specifications on create.
+
+Each of the ispec structures will be a dictionary, since the contents
+can change over time. Initially, we will define the following variables
+in these structures:
+
++---------------+----------------------------------+--------------+
+|Name           |Description                       |Type          |
++===============+==================================+==============+
+|mem_min        |Minimum memory size allowed       |int           |
++---------------+----------------------------------+--------------+
+|mem_max        |Maximum allowed memory size       |int           |
++---------------+----------------------------------+--------------+
+|cpu_count      |Allowed vCPU count                |int           |
++---------------+----------------------------------+--------------+
+|disk_count     |Allowed disk count                |int           |
++---------------+----------------------------------+--------------+
+|disk_size      |Allowed disk size                 |int           |
++---------------+----------------------------------+--------------+
+|nic_count      |Alowed NIC count                  |int           |
++---------------+----------------------------------+--------------+
+
+Inheritance
++++++++++++
+
+In a single-group cluster, the above structure is sufficient. However,
+on a multi-group cluster, it could be that the hardware specifications
+differ across node groups, and thus the following problem appears: how
+can Ganeti present unified specifications over RAPI?
+
+Since the set of instance specs is only partially ordered (as opposed to
+the sets of values of individual variable in the spec, which are totally
+ordered), it follows that we can't present unified specs. As such, the
+proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be
+customised per node-group (and export them as a list of specifications),
+and a single ``std_spec`` at cluster level (exported as a single value).
+
+
+Allocation parameters
+~~~~~~~~~~~~~~~~~~~~~
+
+Beside the limits of min/max instance sizes, there are other parameters
+related to capacity and allocation limits. These are mostly related to
+the problems related to over allocation.
+
++-----------------+----------+---------------------------+----------+------+
+| Name            |Level(s)  |Description                |Current   |Type  |
+|                 |          |                           |value     |      |
++=================+==========+===========================+==========+======+
+|vcpu_ratio       |cluster,  |Maximum ratio of virtual to|64 (only  |float |
+|                 |node group|physical CPUs              |in htools)|      |
++-----------------+----------+---------------------------+----------+------+
+|spindle_ratio    |cluster,  |Maximum ratio of instances |none      |float |
+|                 |node group|to spindles; when the I/O  |          |      |
+|                 |          |model doesn't map directly |          |      |
+|                 |          |to spindles, another       |          |      |
+|                 |          |measure of I/O should be   |          |      |
+|                 |          |used instead               |          |      |
++-----------------+----------+---------------------------+----------+------+
+|max_node_failures|cluster,  |Cap allocation/capacity so |1         |int   |
+|                 |node group|that the cluster can       |(hardcoded|      |
+|                 |          |survive this many node     |in htools)|      |
+|                 |          |failures                   |          |      |
++-----------------+----------+---------------------------+----------+------+
+
+Since these are used mostly internally (in htools), they will be
+exported as-is from Ganeti, without explicit handling of node-groups
+grouping.
+
+Regarding ``spindle_ratio``, in this context spindles do not necessarily
+have to mean actual mechanical hard-drivers; it's rather a measure of
+I/O performance for internal storage.
+
+Disk parameters
+~~~~~~~~~~~~~~~
+
+The propose model for new disk parameters is a simple free-form one
+based on dictionaries, indexed per disk level (template or logical disk)
+and type (which depends on the level). At JSON level, since the object
+key has to be a string, we can encode the keys via a separator
+(e.g. slash), or by having two dict levels.
+
++--------+-------------+-------------------------+---------------------+------+
+|Disk    |Name         |Description              |Current status       |Type  |
+|template|             |                         |                     |      |
++========+=============+=========================+=====================+======+
+|dt/plain|stripes      |How many stripes to use  |Configured at        |int   |
+|        |             |for newly created (plain)|./configure time, not|      |
+|        |             |logical voumes           |overridable at       |      |
+|        |             |                         |runtime              |      |
++--------+-------------+-------------------------+---------------------+------+
+|dt/drdb |stripes      |How many stripes to use  |Same as for lvm      |int   |
+|        |             |for data volumes         |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|dt/drbd |metavg       |Default volume group for |Same as the main     |string|
+|        |             |the metadata LVs         |volume group,        |      |
+|        |             |                         |overridable via      |      |
+|        |             |                         |'metavg' key         |      |
+|        |             |                         |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|dt/drbd |metastripes  |How many stripes to use  |Same as for lvm      |int   |
+|        |             |for meta volumes         |'stripes', suboptimal|      |
+|        |             |                         |as the meta LVs are  |      |
+|        |             |                         |small                |      |
++--------+-------------+-------------------------+---------------------+------+
+|ld/drbd8|disk_barriers|What kind of barriers to |Either all enabled or|string|
+|        |             |*disable* for disks;     |all disabled, per    |      |
+|        |             |either "n" or a string   |./configure time     |      |
+|        |             |containing a subset of   |option               |      |
+|        |             |"bfd"                    |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|ld/drbd8|meta_barriers|Whether barriers are     |Handled together with|bool  |
+|        |             |enabled or not for the   |disk_barriers        |      |
+|        |             |meta volume              |                     |      |
+|        |             |                         |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|ld/drbd8|resync_rate  |The (static) resync rate |Hardcoded in         |int   |
+|        |             |for drbd, when using the |constants.py, not    |      |
+|        |             |static syncer, in MiB/s  |changeable via Ganeti|      |
+|        |             |                         |                     |      |
+|        |             |                         |                     |      |
+|        |             |                         |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|ld/drbd8|disk_custom  |Free-form string that    |Not supported        |string|
+|        |             |will be appended to the  |                     |      |
+|        |             |drbdsetup disk command   |                     |      |
+|        |             |line, for custom options |                     |      |
+|        |             |not supported by Ganeti  |                     |      |
+|        |             |itself                   |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+|ld/drbd8|net_custom   |Free-form string for     |                     |      |
+|        |             |custom net setup options |                     |      |
+|        |             |                         |                     |      |
+|        |             |                         |                     |      |
+|        |             |                         |                     |      |
+|        |             |                         |                     |      |
++--------+-------------+-------------------------+---------------------+------+
+
+Note that the DRBD8 parameters will change once we support DRBD 8.4,
+which has changed syntax significantly; new syncer modes will be added
+for that release.
+
+All the above parameters are at cluster and node group level; as in
+other parts of the code, the intention is that all nodes in a node group
+should be equal.
+
+Node parameters
+~~~~~~~~~~~~~~~
+
+For the new memory model, we'll add the following parameters, in a
+dictionary indexed by the hypervisor name (node attribute
+``hv_state``). The rationale is that, even though multi-hypervisor
+clusters are rare, they make sense sometimes, and thus we need to
+support multipe node states (one per hypervisor).
+
+Since usually only one of the multiple hypervisors is the 'main' one
+(and the others used sparringly), capacity computation will still only
+use the first hypervisor, and not all of them. Thus we avoid possible
+inconsistencies.
+
++----------+-----------------------------------+---------------+-------+
+|Name      |Description                        |Current state  |Type   |
+|          |                                   |               |       |
++==========+===================================+===============+=======+
+|mem_total |Total node memory, as discovered by|Queried at     |int    |
+|          |this hypervisor                    |runtime        |       |
++----------+-----------------------------------+---------------+-------+
+|mem_node  |Memory used by, or reserved for,   |Queried at     |int    |
+|          |the node itself; not that some     |runtime        |       |
+|          |hypervisors can report this in an  |               |       |
+|          |authoritative way, other not       |               |       |
++----------+-----------------------------------+---------------+-------+
+|mem_hv    |Memory used either by the          |Not used,      |int    |
+|          |hypervisor itself or lost due to   |htools computes|       |
+|          |instance allocation rounding;      |it internally  |       |
+|          |usually this cannot be precisely   |               |       |
+|          |computed, but only roughly         |               |       |
+|          |estimated                          |               |       |
++----------+-----------------------------------+---------------+-------+
+|cpu_total |Total node cpu (core) count;       |Queried at     |int    |
+|          |usually this can be discovered     |runtime        |       |
+|          |automatically                      |               |       |
+|          |                                   |               |       |
+|          |                                   |               |       |
+|          |                                   |               |       |
++----------+-----------------------------------+---------------+-------+
+|cpu_node  |Number of cores reserved for the   |Not used at all|int    |
+|          |node itself; this can either be    |               |       |
+|          |discovered or set manually. Only   |               |       |
+|          |used for estimating how many VCPUs |               |       |
+|          |are left for instances             |               |       |
+|          |                                   |               |       |
++----------+-----------------------------------+---------------+-------+
+
+Of the above parameters, only ``_total`` ones are straight-forward. The
+others have sometimes strange semantics:
+
+- Xen can report ``mem_node``, if configured statically (as we
+  recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and
+  this needs to be configured statically for these values
+- ``mem_hv``, representing unaccounted for memory, is not directly
+  computable; on Xen, it can be seen that on a N GB machine, with 1 GB
+  for dom0 and N-2 GB for instances, there's just a few MB left, instead
+  fo a full 1 GB of RAM; however, the exact value varies with the total
+  memory size (at least)
+- ``cpu_node`` only makes sense on Xen (currently), in the case when we
+  restrict dom0; for Linux-based hypervisors, the node itself cannot be
+  easily restricted, so it should be set as an estimate of how "heavy"
+  the node loads will be
+
+Since these two values cannot be auto-computed from the node, we need to
+be able to declare a default at cluster level (debatable how useful they
+are at node group level); the proposal is to do this via a cluster-level
+``hv_state`` dict (per hypervisor).
+
+Beside the per-hypervisor attributes, we also have disk attributes,
+which are queried directly on the node (without hypervisor
+involvment). The are stored in a separate attribute (``disk_state``),
+which is indexed per storage type and name; currently this will be just
+``LD_LV`` and the volume name as key.
+
++-------------+-------------------------+--------------------+--------+
+|Name         |Description              |Current state       |Type    |
+|             |                         |                    |        |
++=============+=========================+====================+========+
+|disk_total   |Total disk size          |Queried at runtime  |int     |
+|             |                         |                    |        |
++-------------+-------------------------+--------------------+--------+
+|disk_reserved|Reserved disk size; this |None used in Ganeti;|int     |
+|             |is a lower limit on the  |htools has a        |        |
+|             |free space, if such a    |parameter for this  |        |
+|             |limit is desired         |                    |        |
++-------------+-------------------------+--------------------+--------+
+|disk_overhead|Disk that is expected to |None used in Ganeti;|int     |
+|             |be used by other volumes |htools detects this |        |
+|             |(set via                 |at runtime          |        |
+|             |``reserved_lvs``);       |                    |        |
+|             |usually should be zero   |                    |        |
++-------------+-------------------------+--------------------+--------+
+
+
+Instance parameters
+~~~~~~~~~~~~~~~~~~~
+
+New instance parameters, needed especially for supporting the new memory
+model:
+
++--------------+----------------------------------+-----------------+------+
+|Name          |Description                       |Current status   |Type  |
+|              |                                  |                 |      |
++==============+==================================+=================+======+
+|offline       |Whether the instance is in        |Not supported    |bool  |
+|              |βpermanentβ offline mode; this is |                 |      |
+|              |stronger than the "admin_downβ    |                 |      |
+|              |state, and is similar to the node |                 |      |
+|              |offline attribute                 |                 |      |
++--------------+----------------------------------+-----------------+------+
+|be/max_memory |The maximum memory the instance is|Not existent, but|int   |
+|              |allowed                           |virtually        |      |
+|              |                                  |identical to     |      |
+|              |                                  |memory           |      |
++--------------+----------------------------------+-----------------+------+
+
+HTools changes
+--------------
+
+All the new parameters (node, instance, cluster, not so much disk) will
+need to be taken into account by HTools, both in balancing and in
+capacity computation.
+
+Since the Ganeti's cluster model is much enhanced, Ganeti can also
+export its own reserved/overhead variables, and as such HTools can make
+less βguessesβ as to the difference in values.
+
+.. admonition:: FIXME
+
+   Need to detail more the htools changes; the model is clear to me, but
+   need to write it down.
+
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End:
-- 
GitLab