Commit 5c0c1eeb authored by Iustin Pop's avatar Iustin Pop

Combine the 2.0 design documents into one

This patch combines all the design documents for 2.0 except the
security one into a single document, in order to ease reading and reduce
duplication of information.

Future patches will start removing wrong pointers to old document names
and some better integration between the sections.

Reviewed-by: imsnah
parent 0c55c24b
......@@ -106,15 +106,7 @@ docsgml = \
docrst = \
doc/design-2.0-cluster-parameters.rst \
doc/design-2.0-commandline-parameters.rst \
doc/design-2.0-disk-handling.rst \
doc/design-2.0-index.rst \
doc/design-2.0-job-queue.rst \
doc/design-2.0-locking.rst \
doc/design-2.0-master-daemon.rst \
doc/design-2.0-os-interface.rst \
doc/design-2.0-rapi-changes.rst \
doc/design-2.0.rst \
doc_DATA = \
Ganeti 2.0 cluster parameters
.. contents::
We need to enhance the way attributes for instances and other clusters
parameters are handled internally within Ganeti in order to have
better flexibility in the following cases:
- introducting new parameters
- writing command line interfaces or APIs for these parameters
- supporting new 2.0 features
When the HVM hypervisor was introduced in Ganeti 1.2, the additional
instance parameters needed for it were simply added to the instance
namespace, as were additional parameters for the PVM hypervisor.
As a result of this, wether a particular parameter is valid for the
actual hypervisor could either be guessed from the name but only
really checked by following the code using it. Similar to this case,
other parameters are not valid in all cases, and were simply added to
the top-level instance objects.
Across all cluster configuration data, we have multiple classes of
A. cluster-wide parameters (e.g. name of the cluster, the master);
these are the ones that we have today, and are unchanged from the
current model
#. node parameters
#. instance specific parameters, e.g. the name of disks (LV), that
cannot be shared with other instances
#. instance parameters, that are or can be the same for many
instances, but are not hypervisor related; e.g. the number of VCPUs,
or the size of memory
#. instance parameters that are hypervisor specific (e.g. kernel_path
or PAE mode)
Detailed Design
The following definitions for instance parameters will be used below:
hypervisor parameter
a hypervisor parameter (or hypervisor specific parameter) is defined
as a parameter that is interpreted by the hypervisor support code in
Ganeti and usually is specific to a particular hypervisor (like the
kernel path for PVM which makes no sense for HVM).
backend parameter
a backend parameter is defined as an instance parameter that can be
shared among a list of instances, and is either generic enough not
to be tied to a given hypervisor or cannot influence at all the
hypervisor behaviour.
For example: memory, vcpus, auto_balance
All these parameters will be encoded into with the prefix "BE\_"
and the whole list of parameters will exist in the set "BES_PARAMETERS"
proper parameter
a parameter whose value is unique to the instance (e.g. the name of a LV,
or the MAC of a NIC)
As a general rule, for all kind of parameters, “None” (or in
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
such, only non-default parameters will be saved as part of objects in
the serialization step, reducing the size of the serialized format.
Cluster parameters
Cluster parameters remain as today, attributes at the top level of the
Cluster object. In addition, two new attributes at this level will
hold defaults for the instances:
- hvparams, a dictionary indexed by hypervisor type, holding default
values for hypervisor parameters that are not defined/overrided by
the instances of this hypervisor type
- beparams, a dictionary holding (for 2.0) a single element 'default',
which holds the default value for backend parameters
Node parameters
Node-related parameters are very few, and we will continue using the
same model for these as previously (attributes on the Node object).
Instance parameters
As described before, the instance parameters are split in three:
instance proper parameters, unique to each instance, instance
hypervisor parameters and instance backend parameters.
The “hvparams” and “beparams” are kept in two dictionaries at instance
level. Only non-default parameters are stored (but once customized, a
parameter will be kept, even with the same value as the default one,
until reset).
The names for hypervisor parameters in the instance.hvparams subtree
should be choosen as generic as possible, especially if specific
parameters could conceivably be useful for more than one hypervisor,
e.g. instance.hvparams.vnc_console_port instead of using both
instance.hvparams.hvm_vnc_console_port and
There are some special cases related to disks and NICs (for example):
a disk has both ganeti-related parameters (e.g. the name of the LV)
and hypervisor-related parameters (how the disk is presented to/named
in the instance). The former parameters remain as proper-instance
parameters, while the latter value are migrated to the hvparams
structure. In 2.0, we will have only globally-per-instance such
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
exported as of the same type).
Starting from the 1.2 list of instance parameters, here is how they
will be mapped to the three classes of parameters:
- name (P)
- primary_node (P)
- os (P)
- hypervisor (P)
- status (P)
- memory (BE)
- vcpus (BE)
- nics (P)
- disks (P)
- disk_template (P)
- network_port (P)
- kernel_path (HV)
- initrd_path (HV)
- hvm_boot_order (HV)
- hvm_acpi (HV)
- hvm_pae (HV)
- hvm_cdrom_image_path (HV)
- hvm_nic_type (HV)
- hvm_disk_type (HV)
- vnc_bind_address (HV)
- serial_no (P)
Parameter validation
To support the new cluster parameter design, additional features will
be required from the hypervisor support implementations in Ganeti.
The hypervisor support implementation API will be extended with the
following features:
:PARAMETERS: class-level attribute holding the list of valid parameters
for this hypervisor
:CheckParamSyntax(hvparams): checks that the given parameters are
valid (as in the names are valid) for this hypervisor; usually just
comparing hvparams.keys() and cls.PARAMETERS; this is a class method
that can be called from within master code (i.e. cmdlib) and should
be safe to do so
:ValidateParameters(hvparams): verifies the values of the provided
parameters against this hypervisor; this is a method that will be
called on the target node, from code, and as such can
make node-specific checks (e.g. kernel_path checking)
Default value application
The application of defaults to an instance is done in the Cluster
object, via two new methods as follows:
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
beparams dict, based on the instance and cluster beparams
The FillHV/BE transformations will be used, for example, in the RpcRunner
when sending an instance for activation/stop, and the sent instance
hvparams/beparams will have the final value (noded code doesn't know
about defaults).
LU code will need to self-call the transformation, if needed.
Opcode changes
The parameter changes will have impact on the OpCodes, especially on
the following ones:
- OpCreateInstance, where the new hv and be parameters will be sent as
dictionaries; note that all hv and be parameters are now optional, as
the values can be instead taken from the cluster
- OpQueryInstances, where we have to be able to query these new
parameters; the syntax for names will be ``hvparam/$NAME`` and
``beparam/$NAME`` for querying an individual parameter out of one
dictionary, and ``hvparams``, respectively ``beparams``, for the whole
- OpModifyInstance, where the the modified parameters are sent as
Additionally, we will need new OpCodes to modify the cluster-level
defaults for the be/hv sets of parameters.
One problem that might appear is that our classification is not
complete or not good enough, and we'll need to change this model. As
the last resort, we will need to rollback and keep 1.2 style.
Another problem is that classification of one parameter is unclear
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
the risk of having to move parameters later between classes.
The only security issue that we foresee is if some new parameters will
have sensitive value. If so, we will need to have a way to export the
config data while purging the sensitive value.
E.g. for the drbd shared secrets, we could export these with the
values replaced by an empty string.
Ganeti 2.0 commandline arguments
.. contents::
Ganeti 2.0 introduces several new features as well as new ways to
handle instance resources like disks or network interfaces. This
requires some noticable changes in the way commandline arguments are
- extend and modify commandline syntax to support new features
- ensure consistent patterns in commandline arguments to reduce cognitive load
Ganeti 2.0 introduces several changes in handling instances resources
such as disks and network cards as well as some new features. Due to
these changes, the commandline syntax needs to be changed
significantly since the existing commandline syntax is not able to
cover the changes.
Design changes for Ganeti 2.0 that require changes for the commandline
syntax, in no particular order:
- flexible instance disk handling: support a variable number of disks
with varying properties per instance,
- flexible instance network interface handling: support a variable
number of network interfaces with varying properties per instance
- multiple hypervisors: multiple hypervisors can be active on the same
cluster, each supporting different parameters,
- support for device type CDROM (via ISO image)
Detailed Design
There are several areas of Ganeti where the commandline arguments will change:
- Cluster configuration
- cluster initialization
- cluster default configuration
- Instance configuration
- handling of network cards for instances,
- handling of disks for instances,
- handling of CDROM devices and
- handling of hypervisor specific options.
Notes about device removal/addition
To avoid problems with device location changes (e.g. second network
interface of the instance becoming the first or third and the like)
the list of network/disk devices is treated as a stack, i.e. devices
can only be added/removed at the end of the list of devices of each
class (disk or network) for each instance.
gnt-instance commands
The commands for gnt-instance will be modified and extended to allow
for the new functionality:
- the add command will be extended to support the new device and
hypervisor options,
- the modify command continues to handle all modifications to
instances, but will be extended with new arguments for handling
Network Device Options
The generic format of the network device option is:
:$DEVNUM: device number, unsigned integer, starting at 0,
:$OPTION: device option, string,
:$VALUE: device option value, string.
Currently, the following device options will be defined (open to
further changes):
:mac: MAC address of the network interface, accepts either a valid
MAC address or the string 'auto'. If 'auto' is specified, a new MAC
address will be generated randomly. If the mac device option is not
specified, the default value 'auto' is assumed.
:bridge: network bridge the network interface is connected
to. Accepts either a valid bridge name (the specified bridge must
exist on the node(s)) as string or the string 'auto'. If 'auto' is
specified, the default brigde is used. If the bridge option is not
specified, the default value 'auto' is assumed.
Disk Device Options
The generic format of the disk device option is:
:$DEVNUM: device number, unsigned integer, starting at 0,
:$OPTION: device option, string,
:$VALUE: device option value, string.
Currently, the following device options will be defined (open to
further changes):
:size: size of the disk device, either a positive number, specifying
the disk size in mebibytes, or a number followed by a magnitude suffix
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
which case the default disk size will be used. If the size option is
not specified, 'auto' is assumed. This option is not valid for all
disk layout types.
:access: access mode of the disk device, a single letter, valid values
- w: read/write access to the disk device or
- r: read-only access to the disk device.
If the access mode is not specified, the default mode of read/write
access will be configured.
:path: path to the image file for the disk device, string. No default
exists. This option is not valid for all disk layout types.
Adding devices
To add devices to an already existing instance, use the device type
specific option to gnt-instance modify. Currently, there are two
device type specific options supported:
:--net: for network interface cards
:--disk: for disk devices
The syntax to the device specific options is similiar to the generic
device options, but instead of specifying a device number like for
gnt-instance add, you specify the magic string add. The new device
will always be appended at the end of the list of devices of this type
for the specified instance, e.g. if the instance has disk devices 0,1
and 2, the newly added disk device will be disk device 3.
Example: gnt-instance modify --net add:mac=auto test-instance
Removing devices
Removing devices from and instance is done via gnt-instance
modify. The same device specific options as for adding instances are
used. Instead of a device number and further device options, only the
magic string remove is specified. It will always remove the last
device in the list of devices of this type for the instance specified,
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
number 3 will be removed.
Example: gnt-instance modify --net remove test-instance
Modifying devices
Modifying devices is also done with device type specific options to
the gnt-instance modify command. There are currently two device type
options supported:
:--net: for network interface cards
:--disk: for disk devices
The syntax to the device specific options is similiar to the generic
device options. The device number you specify identifies the device to
be modified.
Example: gnt-instance modify --disk 2:access=r
Hypervisor Options
Ganeti 2.0 will support more than one hypervisor. Different
hypervisors have various options that only apply to a specific
hypervisor. Those hypervisor specific options are treated specially
via the --hypervisor option. The generic syntax of the hypervisor
option is as follows:
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
has to match the supported hypervisors. Example: xen-pvm
:$OPTION: hypervisor option name, string
:$VALUE: hypervisor option value, string
The hypervisor option for an instance can be set on instance creation
time via the gnt-instance add command. If the hypervisor for an
instance is not specified upon instance creation, the default
hypervisor will be used.
Modifying hypervisor parameters
The hypervisor parameters of an existing instance can be modified
using --hypervisor option of the gnt-instance modify command. However,
the hypervisor type of an existing instance can not be changed, only
the particular hypervisor specific option can be changed. Therefore,
the format of the option parameters has been simplified to omit the
hypervisor name and only contain the comma separated list of
option-value pairs.
Example: gnt-instance modify --hypervisor
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
gnt-cluster commands
The command for gnt-cluster will be extended to allow setting and
changing the default parameters of the cluster:
- The init command will be extend to support the defaults option to
set the cluster defaults upon cluster initialization.
- The modify command will be added to modify the cluster
parameters. It will support the --defaults option to change the
cluster defaults.
Cluster defaults
The generic format of the cluster default setting option is:
:$OPTION: cluster default option, string,
:$VALUE: cluster default option value, string.
Currently, the following cluster default options are defined (open to
further changes):
:hypervisor: the default hypervisor to use for new instances,
string. Must be a valid hypervisor known to and supported by the
:disksize: the disksize for newly created instance disks, where
applicable. Must be either a positive number, in which case the unit
of megabyte is assumed, or a positive number followed by a supported
magnitude symbol (M for megabyte or G for gigabyte).
:bridge: the default network bridge to use for newly created instance
network interfaces, string. Must be a valid bridge name of a bridge
existing on the node(s).
Hypervisor cluster defaults
The generic format of the hypervisor clusterwide default setting option is:
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
to set, string
:$OPTION: cluster default option, string,
:$VALUE: cluster default option value, string.
This diff is collapsed.
Ganeti 2.0 design documents
The 2.x versions of Ganeti will constitute a rewrite of the 'core'
architecture, plus some additional features (however 2.0 is geared
toward the core changes).
Core changes
The main changes will be switching from a per-process model to a
daemon based model, where the individual gnt-* commands will be
clients that talk to this daemon (see the design-2.0-master-daemon
document). This will allow us to get rid of the global cluster lock
for most operations, having instead a per-object lock (see
design-2.0-granular-locking). Also, the daemon will be able to queue
jobs, and this will allow the invidual clients to submit jobs without
waiting for them to finish, and also see the result of old requests
(see design-2.0-job-queue).
Beside these major changes, another 'core' change but that will not be
as visible to the users will be changing the model of object attribute
storage, and separate that into namespaces (such that an Xen PVM
instance will not have the Xen HVM parameters). This will allow future
flexibility in defining additional parameters. More details in the
design-2.0-cluster-parameters document.
The various changes brought in by the master daemon model and the
read-write RAPI will require changes to the cluster security; we move
away from Twisted and use http(s) for intra- and extra-cluster
communications. For more details, see the security document in the
doc/ directory.
Functionality changes
The disk storage will receive some changes, and will also remove
support for the drbd7 and md disk types. See the
design-2.0-disk-changes document.
The configuration storage will be changed, with the effect that more
data will be available on the nodes for access from outside ganeti
(e.g. from shell scripts) and that nodes will get slightly more
awareness of the cluster configuration.
The RAPI will enable modify operations (beside the read-only queries
that are available today), so in effect almost all the operations
available today via the ``gnt-*`` commands will be available via the
remote API.
A change in the hypervisor support area will be that we will support
multiple hypervisors in parallel in the same cluster, so one could run
Xen HVM side-by-side with Xen PVM on the same cluster.
New features
There will be a number of minor feature enhancements targeted to
either 2.0 or subsequent 2.x releases:
- multiple disks, with custom properties (read-only/read-write, exportable,
- multiple NICs
These changes will require OS API changes, details are in the
design-2.0-os-interface document. And they will also require many
command line changes, see the design-2.0-commandline-parameters
Job Queue
.. contents::
In Ganeti 1.2, operations in a cluster have to be done in a serialized way.
Virtually any operation locks the whole cluster by grabbing the global lock.
Other commands can't return before all work has been done.
By implementing a job queue and granular locking, we can lower the latency of
command execution inside a Ganeti cluster.
Detailed Design
Job execution—“Life of a Ganeti job”
#. Job gets submitted by the client. A new job identifier is generated and
assigned to the job. The job is then automatically replicated [#replic]_
to all nodes in the cluster. The identifier is returned to the client.
#. A pool of worker threads waits for new jobs. If all are busy, the job has
to wait and the first worker finishing its work will grab it. Otherwise any
of the waiting threads will pick up the new job.
#. Client waits for job status updates by calling a waiting RPC function.
Log message may be shown to the user. Until the job is started, it can also
be cancelled.
#. As soon as the job is finished, its final result and status can be retrieved
from the server.
#. If the client archives the job, it gets moved to a history directory.
There will be a method to archive all jobs older than a a given age.
.. [#replic] We need replication in order to maintain the consistency across
all nodes in the system; the master node only differs in the fact that
now it is running the master daemon, but it if fails and we do a master
failover, the jobs are still visible on the new master (even though they
will be marked as failed).
Failures to replicate a job to other nodes will be only flagged as
errors in the master daemon log if more than half of the nodes failed,
otherwise we ignore the failure, and rely on the fact that the next
update (for still running jobs) will retry the update. For finished
jobs, it is less of a problem.
Future improvements will look into checking the consistency of the job
list and jobs themselves at master daemon startup.
Job storage
Jobs are stored in the filesystem as individual files, serialized
using JSON (standard serialization mechanism in Ganeti).
The choice of storing each job in its own file was made because:
- a file can be atomically replaced
- a file can easily be replicated to other nodes
- checking consistency across nodes can be implemented very easily, since
all job files should be (at a given moment in time) identical
The other possible choices that were discussed and discounted were:
- single big file with all job data: not feasible due to difficult updates
- in-process databases: hard to replicate the entire database to the
other nodes, and replicating individual operations does not mean wee keep
Queue structure