Commit 5c0c1eeb authored by Iustin Pop's avatar Iustin Pop
Browse files

Combine the 2.0 design documents into one

This patch combines all the design documents for 2.0 except the
security one into a single document, in order to ease reading and reduce
duplication of information.

Future patches will start removing wrong pointers to old document names
and some better integration between the sections.

Reviewed-by: imsnah
parent 0c55c24b
......@@ -106,15 +106,7 @@ docsgml = \
docrst = \
doc/design-2.0-cluster-parameters.rst \
doc/design-2.0-commandline-parameters.rst \
doc/design-2.0-disk-handling.rst \
doc/design-2.0-index.rst \
doc/design-2.0-job-queue.rst \
doc/design-2.0-locking.rst \
doc/design-2.0-master-daemon.rst \
doc/design-2.0-os-interface.rst \
doc/design-2.0-rapi-changes.rst \
doc/design-2.0.rst \
doc_DATA = \
Ganeti 2.0 cluster parameters
.. contents::
We need to enhance the way attributes for instances and other clusters
parameters are handled internally within Ganeti in order to have
better flexibility in the following cases:
- introducting new parameters
- writing command line interfaces or APIs for these parameters
- supporting new 2.0 features
When the HVM hypervisor was introduced in Ganeti 1.2, the additional
instance parameters needed for it were simply added to the instance
namespace, as were additional parameters for the PVM hypervisor.
As a result of this, wether a particular parameter is valid for the
actual hypervisor could either be guessed from the name but only
really checked by following the code using it. Similar to this case,
other parameters are not valid in all cases, and were simply added to
the top-level instance objects.
Across all cluster configuration data, we have multiple classes of
A. cluster-wide parameters (e.g. name of the cluster, the master);
these are the ones that we have today, and are unchanged from the
current model
#. node parameters
#. instance specific parameters, e.g. the name of disks (LV), that
cannot be shared with other instances
#. instance parameters, that are or can be the same for many
instances, but are not hypervisor related; e.g. the number of VCPUs,
or the size of memory
#. instance parameters that are hypervisor specific (e.g. kernel_path
or PAE mode)
Detailed Design
The following definitions for instance parameters will be used below:
hypervisor parameter
a hypervisor parameter (or hypervisor specific parameter) is defined
as a parameter that is interpreted by the hypervisor support code in
Ganeti and usually is specific to a particular hypervisor (like the
kernel path for PVM which makes no sense for HVM).
backend parameter
a backend parameter is defined as an instance parameter that can be
shared among a list of instances, and is either generic enough not
to be tied to a given hypervisor or cannot influence at all the
hypervisor behaviour.
For example: memory, vcpus, auto_balance
All these parameters will be encoded into with the prefix "BE\_"
and the whole list of parameters will exist in the set "BES_PARAMETERS"
proper parameter
a parameter whose value is unique to the instance (e.g. the name of a LV,
or the MAC of a NIC)
As a general rule, for all kind of parameters, “None” (or in
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
such, only non-default parameters will be saved as part of objects in
the serialization step, reducing the size of the serialized format.
Cluster parameters
Cluster parameters remain as today, attributes at the top level of the
Cluster object. In addition, two new attributes at this level will
hold defaults for the instances:
- hvparams, a dictionary indexed by hypervisor type, holding default
values for hypervisor parameters that are not defined/overrided by
the instances of this hypervisor type
- beparams, a dictionary holding (for 2.0) a single element 'default',
which holds the default value for backend parameters
Node parameters
Node-related parameters are very few, and we will continue using the
same model for these as previously (attributes on the Node object).
Instance parameters
As described before, the instance parameters are split in three:
instance proper parameters, unique to each instance, instance
hypervisor parameters and instance backend parameters.
The “hvparams” and “beparams” are kept in two dictionaries at instance
level. Only non-default parameters are stored (but once customized, a
parameter will be kept, even with the same value as the default one,
until reset).
The names for hypervisor parameters in the instance.hvparams subtree
should be choosen as generic as possible, especially if specific
parameters could conceivably be useful for more than one hypervisor,
e.g. instance.hvparams.vnc_console_port instead of using both
instance.hvparams.hvm_vnc_console_port and
There are some special cases related to disks and NICs (for example):
a disk has both ganeti-related parameters (e.g. the name of the LV)
and hypervisor-related parameters (how the disk is presented to/named
in the instance). The former parameters remain as proper-instance
parameters, while the latter value are migrated to the hvparams
structure. In 2.0, we will have only globally-per-instance such
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
exported as of the same type).
Starting from the 1.2 list of instance parameters, here is how they
will be mapped to the three classes of parameters:
- name (P)
- primary_node (P)
- os (P)
- hypervisor (P)
- status (P)
- memory (BE)
- vcpus (BE)
- nics (P)
- disks (P)
- disk_template (P)
- network_port (P)
- kernel_path (HV)
- initrd_path (HV)
- hvm_boot_order (HV)
- hvm_acpi (HV)
- hvm_pae (HV)
- hvm_cdrom_image_path (HV)
- hvm_nic_type (HV)
- hvm_disk_type (HV)
- vnc_bind_address (HV)
- serial_no (P)
Parameter validation
To support the new cluster parameter design, additional features will
be required from the hypervisor support implementations in Ganeti.
The hypervisor support implementation API will be extended with the
following features:
:PARAMETERS: class-level attribute holding the list of valid parameters
for this hypervisor
:CheckParamSyntax(hvparams): checks that the given parameters are
valid (as in the names are valid) for this hypervisor; usually just
comparing hvparams.keys() and cls.PARAMETERS; this is a class method
that can be called from within master code (i.e. cmdlib) and should
be safe to do so
:ValidateParameters(hvparams): verifies the values of the provided
parameters against this hypervisor; this is a method that will be
called on the target node, from code, and as such can
make node-specific checks (e.g. kernel_path checking)
Default value application
The application of defaults to an instance is done in the Cluster
object, via two new methods as follows:
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
beparams dict, based on the instance and cluster beparams
The FillHV/BE transformations will be used, for example, in the RpcRunner
when sending an instance for activation/stop, and the sent instance
hvparams/beparams will have the final value (noded code doesn't know
about defaults).
LU code will need to self-call the transformation, if needed.
Opcode changes
The parameter changes will have impact on the OpCodes, especially on
the following ones:
- OpCreateInstance, where the new hv and be parameters will be sent as
dictionaries; note that all hv and be parameters are now optional, as
the values can be instead taken from the cluster
- OpQueryInstances, where we have to be able to query these new
parameters; the syntax for names will be ``hvparam/$NAME`` and
``beparam/$NAME`` for querying an individual parameter out of one
dictionary, and ``hvparams``, respectively ``beparams``, for the whole
- OpModifyInstance, where the the modified parameters are sent as
Additionally, we will need new OpCodes to modify the cluster-level
defaults for the be/hv sets of parameters.
One problem that might appear is that our classification is not
complete or not good enough, and we'll need to change this model. As
the last resort, we will need to rollback and keep 1.2 style.
Another problem is that classification of one parameter is unclear
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
the risk of having to move parameters later between classes.
The only security issue that we foresee is if some new parameters will
have sensitive value. If so, we will need to have a way to export the
config data while purging the sensitive value.
E.g. for the drbd shared secrets, we could export these with the
values replaced by an empty string.
Ganeti 2.0 commandline arguments
.. contents::
Ganeti 2.0 introduces several new features as well as new ways to
handle instance resources like disks or network interfaces. This
requires some noticable changes in the way commandline arguments are
- extend and modify commandline syntax to support new features
- ensure consistent patterns in commandline arguments to reduce cognitive load
Ganeti 2.0 introduces several changes in handling instances resources
such as disks and network cards as well as some new features. Due to
these changes, the commandline syntax needs to be changed
significantly since the existing commandline syntax is not able to
cover the changes.
Design changes for Ganeti 2.0 that require changes for the commandline
syntax, in no particular order:
- flexible instance disk handling: support a variable number of disks
with varying properties per instance,
- flexible instance network interface handling: support a variable
number of network interfaces with varying properties per instance
- multiple hypervisors: multiple hypervisors can be active on the same
cluster, each supporting different parameters,
- support for device type CDROM (via ISO image)
Detailed Design
There are several areas of Ganeti where the commandline arguments will change:
- Cluster configuration
- cluster initialization
- cluster default configuration
- Instance configuration
- handling of network cards for instances,
- handling of disks for instances,
- handling of CDROM devices and
- handling of hypervisor specific options.
Notes about device removal/addition
To avoid problems with device location changes (e.g. second network
interface of the instance becoming the first or third and the like)
the list of network/disk devices is treated as a stack, i.e. devices
can only be added/removed at the end of the list of devices of each
class (disk or network) for each instance.
gnt-instance commands
The commands for gnt-instance will be modified and extended to allow
for the new functionality:
- the add command will be extended to support the new device and
hypervisor options,
- the modify command continues to handle all modifications to
instances, but will be extended with new arguments for handling
Network Device Options
The generic format of the network device option is:
:$DEVNUM: device number, unsigned integer, starting at 0,
:$OPTION: device option, string,
:$VALUE: device option value, string.
Currently, the following device options will be defined (open to
further changes):
:mac: MAC address of the network interface, accepts either a valid
MAC address or the string 'auto'. If 'auto' is specified, a new MAC
address will be generated randomly. If the mac device option is not
specified, the default value 'auto' is assumed.
:bridge: network bridge the network interface is connected
to. Accepts either a valid bridge name (the specified bridge must
exist on the node(s)) as string or the string 'auto'. If 'auto' is
specified, the default brigde is used. If the bridge option is not
specified, the default value 'auto' is assumed.
Disk Device Options
The generic format of the disk device option is:
:$DEVNUM: device number, unsigned integer, starting at 0,
:$OPTION: device option, string,
:$VALUE: device option value, string.
Currently, the following device options will be defined (open to
further changes):
:size: size of the disk device, either a positive number, specifying
the disk size in mebibytes, or a number followed by a magnitude suffix
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
which case the default disk size will be used. If the size option is
not specified, 'auto' is assumed. This option is not valid for all
disk layout types.
:access: access mode of the disk device, a single letter, valid values
- w: read/write access to the disk device or
- r: read-only access to the disk device.
If the access mode is not specified, the default mode of read/write
access will be configured.
:path: path to the image file for the disk device, string. No default
exists. This option is not valid for all disk layout types.
Adding devices
To add devices to an already existing instance, use the device type
specific option to gnt-instance modify. Currently, there are two
device type specific options supported:
:--net: for network interface cards
:--disk: for disk devices
The syntax to the device specific options is similiar to the generic
device options, but instead of specifying a device number like for
gnt-instance add, you specify the magic string add. The new device
will always be appended at the end of the list of devices of this type
for the specified instance, e.g. if the instance has disk devices 0,1
and 2, the newly added disk device will be disk device 3.
Example: gnt-instance modify --net add:mac=auto test-instance
Removing devices
Removing devices from and instance is done via gnt-instance
modify. The same device specific options as for adding instances are
used. Instead of a device number and further device options, only the
magic string remove is specified. It will always remove the last
device in the list of devices of this type for the instance specified,
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
number 3 will be removed.
Example: gnt-instance modify --net remove test-instance
Modifying devices
Modifying devices is also done with device type specific options to
the gnt-instance modify command. There are currently two device type
options supported:
:--net: for network interface cards
:--disk: for disk devices
The syntax to the device specific options is similiar to the generic
device options. The device number you specify identifies the device to
be modified.
Example: gnt-instance modify --disk 2:access=r
Hypervisor Options
Ganeti 2.0 will support more than one hypervisor. Different
hypervisors have various options that only apply to a specific
hypervisor. Those hypervisor specific options are treated specially
via the --hypervisor option. The generic syntax of the hypervisor
option is as follows:
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
has to match the supported hypervisors. Example: xen-pvm
:$OPTION: hypervisor option name, string
:$VALUE: hypervisor option value, string
The hypervisor option for an instance can be set on instance creation
time via the gnt-instance add command. If the hypervisor for an
instance is not specified upon instance creation, the default
hypervisor will be used.
Modifying hypervisor parameters
The hypervisor parameters of an existing instance can be modified
using --hypervisor option of the gnt-instance modify command. However,
the hypervisor type of an existing instance can not be changed, only
the particular hypervisor specific option can be changed. Therefore,
the format of the option parameters has been simplified to omit the
hypervisor name and only contain the comma separated list of
option-value pairs.
Example: gnt-instance modify --hypervisor
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
gnt-cluster commands
The command for gnt-cluster will be extended to allow setting and
changing the default parameters of the cluster:
- The init command will be extend to support the defaults option to
set the cluster defaults upon cluster initialization.
- The modify command will be added to modify the cluster
parameters. It will support the --defaults option to change the
cluster defaults.
Cluster defaults
The generic format of the cluster default setting option is:
:$OPTION: cluster default option, string,
:$VALUE: cluster default option value, string.
Currently, the following cluster default options are defined (open to
further changes):
:hypervisor: the default hypervisor to use for new instances,
string. Must be a valid hypervisor known to and supported by the
:disksize: the disksize for newly created instance disks, where
applicable. Must be either a positive number, in which case the unit
of megabyte is assumed, or a positive number followed by a supported
magnitude symbol (M for megabyte or G for gigabyte).
:bridge: the default network bridge to use for newly created instance
network interfaces, string. Must be a valid bridge name of a bridge
existing on the node(s).
Hypervisor cluster defaults
The generic format of the hypervisor clusterwide default setting option is:
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
to set, string
:$OPTION: cluster default option, string,
:$VALUE: cluster default option value, string.
Ganeti 2.0 disk handling changes
Change the storage options available and the details of the
implementation such that we overcome some design limitations present
in Ganeti 1.x.
The storage options available in Ganeti 1.x were introduced based on
then-current software (DRBD 0.7 and later DRBD 8) and the estimated
usage patters. However, experience has later shown that some
assumptions made initially are not true and that more flexibility is
One main assupmtion made was that disk failures should be treated as 'rare'
events, and that each of them needs to be manually handled in order to ensure
data safety; however, both these assumptions are false:
- disk failures can be a common occurence, based on usage patterns or cluster
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
automate more of the recovery
Note that we still don't have fully-automated disk recovery as a goal, but our
goal is to reduce the manual work needed.
We plan the following main changes:
- DRBD8 is much more flexible and stable than its previous version (0.7),
such that removing the support for the ``remote_raid1`` template and
focusing only on DRBD8 is easier
- dynamic discovery of DRBD devices is not actually needed in a cluster that
where the DRBD namespace is controlled by Ganeti; switching to a static
assignment (done at either instance creation time or change secondary time)
will change the disk activation time from O(n) to O(1), which on big
clusters is a significant gain
- remove the hard dependency on LVM (currently all available storage types are
ultimately backed by LVM volumes) by introducing file-based storage
Additionally, a number of smaller enhancements are also planned:
- support variable number of disks
- support read-only disks
Future enhancements in the 2.x series, which do not require base design
changes, might include:
- enhancement of the LVM allocation method in order to try to keep
all of an instance's virtual disks on the same physical
- add support for DRBD8 authentication at handshake time in
order to ensure each device connects to the correct peer
- remove the restrictions on failover only to the secondary
which creates very strict rules on cluster allocation
Detailed Design
DRBD minor allocation
Currently, when trying to identify or activate a new DRBD (or MD)
device, the code scans all in-use devices in order to see if we find
one that looks similar to our parameters and is already in the desired
state or not. Since this needs external commands to be run, it is very
slow when more than a few devices are already present.
Therefore, we will change the discovery model from dynamic to
static. When a new device is logically created (added to the
configuration) a free minor number is computed from the list of
devices that should exist on that node and assigned to that
At device activation, if the minor is already in use, we check if
it has our parameters; if not so, we just destroy the device (if
possible, otherwise we abort) and start it with our own
This means that we in effect take ownership of the minor space for
that device type; if there's a user-created drbd minor, it will be
automatically removed.
The change will have the effect of reducing the number of external
commands run per device from a constant number times the index of the
first free DRBD minor to just a constant number.
Removal of obsolete device types (md, drbd7)
We need to remove these device types because of two issues. First,
drbd7 has bad failure modes in case of dual failures (both network and
disk - it cannot propagate the error up the device stack and instead
just panics. Second, due to the assymetry between primary and
secondary in md+drbd mode, we cannot do live failover (not even if we
had md+drbd8).
File-based storage support
This is covered by a separate design doc (<em>Vinales</em>) and
would allow us to get rid of the hard requirement for testing
clusters; it would also allow people who have SAN storage to do live
failover taking advantage of their storage solution.
Variable number of disks
In order to support high-security scenarios (for example read-only sda
and read-write sdb), we need to make a fully flexibly disk
definition. This has less impact that it might look at first sight:
only the instance creation has hardcoded number of disks, not the disk