diff --git a/NEWS b/NEWS index 20290a01de426131dfd71ffa37f107a7480dac68..c540562c9b4edde679be1b1acaa2a97f5ebec9aa 100644 --- a/NEWS +++ b/NEWS @@ -7,8 +7,8 @@ Version 2.0.3 - Added ``--ignore-size`` to the ``gnt-instance activate-disks`` command to allow using the pre-2.0.2 behaviour in activation, if any existing instances have mismatched disk sizes in the configuration -- Added ``gnt-cluster repair-disk-sizes`` command to check and update any - configuration mismatches for disk sizes +- Added ``gnt-cluster repair-disk-sizes`` command to check and update + any configuration mismatches for disk sizes - Added ``gnt-master cluste-failover --no-voting`` to allow master failover to work on two-node clusters - Fixed the β--netβ option of ``gnt-backup import``, which was unusable @@ -61,9 +61,9 @@ Version 2.0.1 - the watcher now also restarts the node daemon and the rapi daemon if they died - fixed the watcher to handle full and drained queue cases -- hooks export more instance data in the environment, which helps if hook - scripts need to take action based on the instance's properties (no - longer need to query back into ganeti) +- hooks export more instance data in the environment, which helps if + hook scripts need to take action based on the instance's properties + (no longer need to query back into ganeti) - instance failovers when the instance is stopped do not check for free RAM, so that failing over a stopped instance is possible in low memory situations @@ -169,10 +169,10 @@ Version 2.0 beta 1 - all commands are executed by a daemon (``ganeti-masterd``) and the various ``gnt-*`` commands are just front-ends to it - - all the commands are entered into, and executed from a job queue, see - the ``gnt-job(8)`` manpage - - the RAPI daemon supports read-write operations, secured by basic HTTP - authentication on top of HTTPS + - all the commands are entered into, and executed from a job queue, + see the ``gnt-job(8)`` manpage + - the RAPI daemon supports read-write operations, secured by basic + HTTP authentication on top of HTTPS - DRBD version 0.7 support has been removed, DRBD 8 is the only supported version (when migrating from Ganeti 1.2 to 2.0, you need to migrate to DRBD 8 first while still running Ganeti 1.2) @@ -193,8 +193,8 @@ Version 1.2.7 - Change the default reboot type in ``gnt-instance reboot`` to "hard" - Reuse the old instance mac address by default on instance import, if the instance name is the same. -- Handle situations in which the node info rpc returns incomplete results - (issue 46) +- Handle situations in which the node info rpc returns incomplete + results (issue 46) - Add checks for tcp/udp ports collisions in ``gnt-cluster verify`` - Improved version of batcher: @@ -218,10 +218,10 @@ Version 1.2.6 - new ``--hvm-nic-type`` and ``--hvm-disk-type`` flags to control the type of disk exported to fully virtualized instances. - provide access to the serial console of HVM instances -- instance auto_balance flag, set by default. If turned off it will avoid - warnings on cluster verify if there is not enough memory to fail over - an instance. in the future it will prevent automatically failing it - over when we will support that. +- instance auto_balance flag, set by default. If turned off it will + avoid warnings on cluster verify if there is not enough memory to fail + over an instance. in the future it will prevent automatically failing + it over when we will support that. - batcher tool for instance creation, see ``tools/README.batcher`` - ``gnt-instance reinstall --select-os`` to interactively select a new operating system when reinstalling an instance. @@ -347,8 +347,8 @@ Version 1.2.1 Version 1.2.0 ------------- -- Log the ``xm create`` output to the node daemon log on failure (to help - diagnosing the error) +- Log the ``xm create`` output to the node daemon log on failure (to + help diagnosing the error) - In debug mode, log all external commands output if failed to the logs - Change parsing of lvm commands to ignore stderr @@ -384,8 +384,8 @@ Version 1.2b2 reboots - Removed dependency on debian's patched fping that uses the non-standard ``-S`` option -- Now the OS definitions are searched for in multiple, configurable paths - (easier for distros to package) +- Now the OS definitions are searched for in multiple, configurable + paths (easier for distros to package) - Some changes to the hooks infrastructure (especially the new post-configuration update hook) - Other small bugfixes diff --git a/doc/admin.rst b/doc/admin.rst index 3c23ca4c3a5c5670e63b5ead11b54bf970ca6ae1..3d0003dd924e71dd7f04f87e9906ccc7c1e2b807 100644 --- a/doc/admin.rst +++ b/doc/admin.rst @@ -343,7 +343,8 @@ At this point, the machines are ready for a cluster creation; in case you want to remove Ganeti completely, you need to also undo some of the SSH changes and log directories: -- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct paths) +- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct + paths) - remove from ``/root/.ssh`` the keys that Ganeti added (check the ``authorized_keys`` and ``id_dsa`` files) - regenerate the host's SSH keys (check the OpenSSH startup scripts) diff --git a/doc/design-2.0.rst b/doc/design-2.0.rst index effeb1d8edb3ea006f959dbe12ee7f8b0c06eb33..c2c4591d837de55e77ddbd74758b0023615ef44c 100644 --- a/doc/design-2.0.rst +++ b/doc/design-2.0.rst @@ -30,7 +30,8 @@ following main scalability issues: - poor handling of node failures in the cluster - mixing hypervisors in a cluster not allowed -It also has a number of artificial restrictions, due to historical design: +It also has a number of artificial restrictions, due to historical +design: - fixed number of disks (two) per instance - fixed number of NICs @@ -55,8 +56,8 @@ operations. This has been painful at various times, for example: - It is impossible for two people to efficiently interact with a cluster (for example for debugging) at the same time. -- When batch jobs are running it's impossible to do other work (for example - failovers/fixes) on a cluster. +- When batch jobs are running it's impossible to do other work (for + example failovers/fixes) on a cluster. This poses scalability problems: as clusters grow in node and instance size it's a lot more likely that operations which one could conceive @@ -155,7 +156,8 @@ In Ganeti 2.0, we will have the following *entities*: The master-daemon related interaction paths are: -- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API +- (CLI tools/RAPI daemon) and the master daemon, via the so called + *LUXI* API - the master daemon and the node daemons, via the node RPC There are also some additional interaction paths for exceptional cases: @@ -237,10 +239,10 @@ Responses will follow the same format, with the two fields being: There are two special value for the result field: - in the case that the operation failed, and this field is a list of - length two, the client library will try to interpret is as an exception, - the first element being the exception type and the second one the - actual exception arguments; this will allow a simple method of passing - Ganeti-related exception across the interface + length two, the client library will try to interpret is as an + exception, the first element being the exception type and the second + one the actual exception arguments; this will allow a simple method of + passing Ganeti-related exception across the interface - for the *WaitForChange* call (that waits on the server for a job to change status), if the result is equal to ``nochange`` instead of the usual result for this call (a list of changes), then the library will @@ -381,13 +383,14 @@ disadvantages to using it: - the more advanced granular locking that we want to implement would require, if written in the async-manner, deep integration with the Twisted stack, to such an extend that business-logic is inseparable - from the protocol coding; we felt that this is an unreasonable request, - and that a good protocol library should allow complete separation of - low-level protocol calls and business logic; by comparison, the threaded - approach combined with HTTPs protocol required (for the first iteration) - absolutely no changes from the 1.2 code, and later changes for optimizing - the inter-node RPC calls required just syntactic changes (e.g. - ``rpc.call_...`` to ``self.rpc.call_...``) + from the protocol coding; we felt that this is an unreasonable + request, and that a good protocol library should allow complete + separation of low-level protocol calls and business logic; by + comparison, the threaded approach combined with HTTPs protocol + required (for the first iteration) absolutely no changes from the 1.2 + code, and later changes for optimizing the inter-node RPC calls + required just syntactic changes (e.g. ``rpc.call_...`` to + ``self.rpc.call_...``) Another issue is with the Twisted API stability - during the Ganeti 1.x lifetime, we had to to implement many times workarounds to changes @@ -401,9 +404,10 @@ we just reused that for inter-node communication. Granular locking ~~~~~~~~~~~~~~~~ -We want to make sure that multiple operations can run in parallel on a Ganeti -Cluster. In order for this to happen we need to make sure concurrently run -operations don't step on each other toes and break the cluster. +We want to make sure that multiple operations can run in parallel on a +Ganeti Cluster. In order for this to happen we need to make sure +concurrently run operations don't step on each other toes and break the +cluster. This design addresses how we are going to deal with locking so that: @@ -411,23 +415,25 @@ This design addresses how we are going to deal with locking so that: - we prevent deadlocks - we prevent job starvation -Reaching the maximum possible parallelism is a Non-Goal. We have identified a -set of operations that are currently bottlenecks and need to be parallelised -and have worked on those. In the future it will be possible to address other -needs, thus making the cluster more and more parallel one step at a time. +Reaching the maximum possible parallelism is a Non-Goal. We have +identified a set of operations that are currently bottlenecks and need +to be parallelised and have worked on those. In the future it will be +possible to address other needs, thus making the cluster more and more +parallel one step at a time. This section only talks about parallelising Ganeti level operations, aka -Logical Units, and the locking needed for that. Any other synchronization lock -needed internally by the code is outside its scope. +Logical Units, and the locking needed for that. Any other +synchronization lock needed internally by the code is outside its scope. Library details +++++++++++++++ The proposed library has these features: -- internally managing all the locks, making the implementation transparent - from their usage -- automatically grabbing multiple locks in the right order (avoid deadlock) +- internally managing all the locks, making the implementation + transparent from their usage +- automatically grabbing multiple locks in the right order (avoid + deadlock) - ability to transparently handle conversion to more granularity - support asynchronous operation (future goal) @@ -446,9 +452,9 @@ All the locks will be represented by objects (like ``lockings.SharedLock``), and the individual locks for each object will be created at initialisation time, from the config file. -The API will have a way to grab one or more than one locks at the same time. -Any attempt to grab a lock while already holding one in the wrong order will be -checked for, and fail. +The API will have a way to grab one or more than one locks at the same +time. Any attempt to grab a lock while already holding one in the wrong +order will be checked for, and fail. The Locks @@ -460,11 +466,11 @@ At the first stage we have decided to provide the following locks: - One lock per node in the cluster - One lock per instance in the cluster -All the instance locks will need to be taken before the node locks, and the -node locks before the config lock. Locks will need to be acquired at the same -time for multiple instances and nodes, and internal ordering will be dealt -within the locking library, which, for simplicity, will just use alphabetical -order. +All the instance locks will need to be taken before the node locks, and +the node locks before the config lock. Locks will need to be acquired at +the same time for multiple instances and nodes, and internal ordering +will be dealt within the locking library, which, for simplicity, will +just use alphabetical order. Each lock has the following three possible statuses: @@ -475,37 +481,39 @@ Each lock has the following three possible statuses: Handling conversion to more granularity +++++++++++++++++++++++++++++++++++++++ -In order to convert to a more granular approach transparently each time we -split a lock into more we'll create a "metalock", which will depend on those -sub-locks and live for the time necessary for all the code to convert (or -forever, in some conditions). When a metalock exists all converted code must -acquire it in shared mode, so it can run concurrently, but still be exclusive -with old code, which acquires it exclusively. +In order to convert to a more granular approach transparently each time +we split a lock into more we'll create a "metalock", which will depend +on those sub-locks and live for the time necessary for all the code to +convert (or forever, in some conditions). When a metalock exists all +converted code must acquire it in shared mode, so it can run +concurrently, but still be exclusive with old code, which acquires it +exclusively. -In the beginning the only such lock will be what replaces the current "command" -lock, and will acquire all the locks in the system, before proceeding. This -lock will be called the "Big Ganeti Lock" because holding that one will avoid -any other concurrent Ganeti operations. +In the beginning the only such lock will be what replaces the current +"command" lock, and will acquire all the locks in the system, before +proceeding. This lock will be called the "Big Ganeti Lock" because +holding that one will avoid any other concurrent Ganeti operations. -We might also want to devise more metalocks (eg. all nodes, all nodes+config) -in order to make it easier for some parts of the code to acquire what it needs -without specifying it explicitly. +We might also want to devise more metalocks (eg. all nodes, all +nodes+config) in order to make it easier for some parts of the code to +acquire what it needs without specifying it explicitly. -In the future things like the node locks could become metalocks, should we -decide to split them into an even more fine grained approach, but this will -probably be only after the first 2.0 version has been released. +In the future things like the node locks could become metalocks, should +we decide to split them into an even more fine grained approach, but +this will probably be only after the first 2.0 version has been +released. Adding/Removing locks +++++++++++++++++++++ -When a new instance or a new node is created an associated lock must be added -to the list. The relevant code will need to inform the locking library of such -a change. +When a new instance or a new node is created an associated lock must be +added to the list. The relevant code will need to inform the locking +library of such a change. -This needs to be compatible with every other lock in the system, especially -metalocks that guarantee to grab sets of resources without specifying them -explicitly. The implementation of this will be handled in the locking library -itself. +This needs to be compatible with every other lock in the system, +especially metalocks that guarantee to grab sets of resources without +specifying them explicitly. The implementation of this will be handled +in the locking library itself. When instances or nodes disappear from the cluster the relevant locks must be removed. This is easier than adding new elements, as the code @@ -517,36 +525,39 @@ Asynchronous operations +++++++++++++++++++++++ For the first version the locking library will only export synchronous -operations, which will block till the needed lock are held, and only fail if -the request is impossible or somehow erroneous. +operations, which will block till the needed lock are held, and only +fail if the request is impossible or somehow erroneous. In the future we may want to implement different types of asynchronous operations such as: - try to acquire this lock set and fail if not possible -- try to acquire one of these lock sets and return the first one you were - able to get (or after a timeout) (select/poll like) +- try to acquire one of these lock sets and return the first one you + were able to get (or after a timeout) (select/poll like) -These operations can be used to prioritize operations based on available locks, -rather than making them just blindly queue for acquiring them. The inherent -risk, though, is that any code using the first operation, or setting a timeout -for the second one, is susceptible to starvation and thus may never be able to -get the required locks and complete certain tasks. Considering this -providing/using these operations should not be among our first priorities. +These operations can be used to prioritize operations based on available +locks, rather than making them just blindly queue for acquiring them. +The inherent risk, though, is that any code using the first operation, +or setting a timeout for the second one, is susceptible to starvation +and thus may never be able to get the required locks and complete +certain tasks. Considering this providing/using these operations should +not be among our first priorities. Locking granularity +++++++++++++++++++ For the first version of this code we'll convert each Logical Unit to -acquire/release the locks it needs, so locking will be at the Logical Unit -level. In the future we may want to split logical units in independent -"tasklets" with their own locking requirements. A different design doc (or mini -design doc) will cover the move from Logical Units to tasklets. +acquire/release the locks it needs, so locking will be at the Logical +Unit level. In the future we may want to split logical units in +independent "tasklets" with their own locking requirements. A different +design doc (or mini design doc) will cover the move from Logical Units +to tasklets. Code examples +++++++++++++ -In general when acquiring locks we should use a code path equivalent to:: +In general when acquiring locks we should use a code path equivalent +to:: lock.acquire() try: @@ -561,10 +572,10 @@ structures in an unusable state. Note that with Python 2.5 a simpler syntax will be possible, but we want to keep compatibility with Python 2.4 so the new constructs should not be used. -In order to avoid this extra indentation and code changes everywhere in the -Logical Units code, we decided to allow LUs to declare locks, and then execute -their code with their locks acquired. In the new world LUs are called like -this:: +In order to avoid this extra indentation and code changes everywhere in +the Logical Units code, we decided to allow LUs to declare locks, and +then execute their code with their locks acquired. In the new world LUs +are called like this:: # user passed names are expanded to the internal lock/resource name, # then known needed locks are declared @@ -579,22 +590,23 @@ this:: lu.Exec() ... locks declared for removal are removed, all acquired locks released ... -The Processor and the LogicalUnit class will contain exact documentation on how -locks are supposed to be declared. +The Processor and the LogicalUnit class will contain exact documentation +on how locks are supposed to be declared. Caveats +++++++ This library will provide an easy upgrade path to bring all the code to granular locking without breaking everything, and it will also guarantee -against a lot of common errors. Code switching from the old "lock everything" -lock to the new system, though, needs to be carefully scrutinised to be sure it -is really acquiring all the necessary locks, and none has been overlooked or -forgotten. +against a lot of common errors. Code switching from the old "lock +everything" lock to the new system, though, needs to be carefully +scrutinised to be sure it is really acquiring all the necessary locks, +and none has been overlooked or forgotten. -The code can contain other locks outside of this library, to synchronise other -threaded code (eg for the job queue) but in general these should be leaf locks -or carefully structured non-leaf ones, to avoid deadlock race conditions. +The code can contain other locks outside of this library, to synchronise +other threaded code (eg for the job queue) but in general these should +be leaf locks or carefully structured non-leaf ones, to avoid deadlock +race conditions. Job Queue @@ -614,25 +626,26 @@ will generate N opcodes of type replace disks). Job executionββLife of a Ganeti jobβ ++++++++++++++++++++++++++++++++++++ -#. Job gets submitted by the client. A new job identifier is generated and - assigned to the job. The job is then automatically replicated [#replic]_ - to all nodes in the cluster. The identifier is returned to the client. -#. A pool of worker threads waits for new jobs. If all are busy, the job has - to wait and the first worker finishing its work will grab it. Otherwise any - of the waiting threads will pick up the new job. -#. Client waits for job status updates by calling a waiting RPC function. - Log message may be shown to the user. Until the job is started, it can also - be canceled. -#. As soon as the job is finished, its final result and status can be retrieved - from the server. +#. Job gets submitted by the client. A new job identifier is generated + and assigned to the job. The job is then automatically replicated + [#replic]_ to all nodes in the cluster. The identifier is returned to + the client. +#. A pool of worker threads waits for new jobs. If all are busy, the job + has to wait and the first worker finishing its work will grab it. + Otherwise any of the waiting threads will pick up the new job. +#. Client waits for job status updates by calling a waiting RPC + function. Log message may be shown to the user. Until the job is + started, it can also be canceled. +#. As soon as the job is finished, its final result and status can be + retrieved from the server. #. If the client archives the job, it gets moved to a history directory. There will be a method to archive all jobs older than a a given age. -.. [#replic] We need replication in order to maintain the consistency across - all nodes in the system; the master node only differs in the fact that - now it is running the master daemon, but it if fails and we do a master - failover, the jobs are still visible on the new master (though marked as - failed). +.. [#replic] We need replication in order to maintain the consistency + across all nodes in the system; the master node only differs in the + fact that now it is running the master daemon, but it if fails and we + do a master failover, the jobs are still visible on the new master + (though marked as failed). Failures to replicate a job to other nodes will be only flagged as errors in the master daemon log if more than half of the nodes failed, @@ -654,23 +667,24 @@ The choice of storing each job in its own file was made because: - a file can be atomically replaced - a file can easily be replicated to other nodes -- checking consistency across nodes can be implemented very easily, since - all job files should be (at a given moment in time) identical +- checking consistency across nodes can be implemented very easily, + since all job files should be (at a given moment in time) identical The other possible choices that were discussed and discounted were: -- single big file with all job data: not feasible due to difficult updates +- single big file with all job data: not feasible due to difficult + updates - in-process databases: hard to replicate the entire database to the - other nodes, and replicating individual operations does not mean wee keep - consistency + other nodes, and replicating individual operations does not mean wee + keep consistency Queue structure +++++++++++++++ -All file operations have to be done atomically by writing to a temporary file -and subsequent renaming. Except for log messages, every change in a job is -stored and replicated to other nodes. +All file operations have to be done atomically by writing to a temporary +file and subsequent renaming. Except for log messages, every change in a +job is stored and replicated to other nodes. :: @@ -688,9 +702,9 @@ stored and replicated to other nodes. Locking +++++++ -Locking in the job queue is a complicated topic. It is called from more than -one thread and must be thread-safe. For simplicity, a single lock is used for -the whole job queue. +Locking in the job queue is a complicated topic. It is called from more +than one thread and must be thread-safe. For simplicity, a single lock +is used for the whole job queue. A more detailed description can be found in doc/locking.rst. @@ -711,24 +725,25 @@ jobqueue_rename(old, new) Client RPC ++++++++++ -RPC between Ganeti clients and the Ganeti master daemon supports the following -operations: +RPC between Ganeti clients and the Ganeti master daemon supports the +following operations: SubmitJob(ops) - Submits a list of opcodes and returns the job identifier. The identifier is - guaranteed to be unique during the lifetime of a cluster. + Submits a list of opcodes and returns the job identifier. The + identifier is guaranteed to be unique during the lifetime of a + cluster. WaitForJobChange(job_id, fields, [β¦], timeout) - This function waits until a job changes or a timeout expires. The condition - for when a job changed is defined by the fields passed and the last log - message received. + This function waits until a job changes or a timeout expires. The + condition for when a job changed is defined by the fields passed and + the last log message received. QueryJobs(job_ids, fields) Returns field values for the job identifiers passed. CancelJob(job_id) - Cancels the job specified by identifier. This operation may fail if the job - is already running, canceled or finished. + Cancels the job specified by identifier. This operation may fail if + the job is already running, canceled or finished. ArchiveJob(job_id) - Moves a job into the β¦/archive/ directory. This operation will fail if the - job has not been canceled or finished. + Moves a job into the β¦/archive/ directory. This operation will fail if + the job has not been canceled or finished. Job and opcode status @@ -749,8 +764,8 @@ Success Error The job/opcode was aborted with an error. -If the master is aborted while a job is running, the job will be set to the -Error status once the master started again. +If the master is aborted while a job is running, the job will be set to +the Error status once the master started again. History @@ -810,12 +825,13 @@ The following definitions for instance parameters will be used below: For example: memory, vcpus, auto_balance - All these parameters will be encoded into constants.py with the prefix "BE\_" - and the whole list of parameters will exist in the set "BES_PARAMETERS" + All these parameters will be encoded into constants.py with the prefix + "BE\_" and the whole list of parameters will exist in the set + "BES_PARAMETERS" :proper parameter: - a parameter whose value is unique to the instance (e.g. the name of a LV, - or the MAC of a NIC) + a parameter whose value is unique to the instance (e.g. the name of a + LV, or the MAC of a NIC) As a general rule, for all kind of parameters, βNoneβ (or in JSON-speak, βnilβ) will no longer be a valid value for a parameter. As @@ -932,10 +948,10 @@ object, via two new methods as follows: - ``Cluster.FillBE(instance, be_type="default")``, which returns the beparams dict, based on the instance and cluster beparams -The FillHV/BE transformations will be used, for example, in the RpcRunner -when sending an instance for activation/stop, and the sent instance -hvparams/beparams will have the final value (noded code doesn't know -about defaults). +The FillHV/BE transformations will be used, for example, in the +RpcRunner when sending an instance for activation/stop, and the sent +instance hvparams/beparams will have the final value (noded code doesn't +know about defaults). LU code will need to self-call the transformation, if needed. @@ -945,9 +961,9 @@ Opcode changes The parameter changes will have impact on the OpCodes, especially on the following ones: -- ``OpCreateInstance``, where the new hv and be parameters will be sent as - dictionaries; note that all hv and be parameters are now optional, as - the values can be instead taken from the cluster +- ``OpCreateInstance``, where the new hv and be parameters will be sent + as dictionaries; note that all hv and be parameters are now optional, + as the values can be instead taken from the cluster - ``OpQueryInstances``, where we have to be able to query these new parameters; the syntax for names will be ``hvparam/$NAME`` and ``beparam/$NAME`` for querying an individual parameter out of one @@ -1093,8 +1109,8 @@ The code is changed in the following ways: Caveats: - some operation semantics are less clear (e.g. what to do on instance - start with offline secondary?); for now, these will just fail as if the - flag is not set (but faster) + start with offline secondary?); for now, these will just fail as if + the flag is not set (but faster) - 2-node cluster with one node offline needs manual startup of the master with a special flag to skip voting (as the master can't get a quorum there) @@ -1133,7 +1149,8 @@ following situation: clean the above instance(s) In order to prevent this situation, and to be able to get nodes into -proper offline status easily, a new *drained* flag was added to the nodes. +proper offline status easily, a new *drained* flag was added to the +nodes. This flag (which actually means "is being, or was drained, and is expected to go offline"), will prevent allocations on the node, but @@ -1173,32 +1190,33 @@ estimated usage patters. However, experience has later shown that some assumptions made initially are not true and that more flexibility is needed. -One main assumption made was that disk failures should be treated as 'rare' -events, and that each of them needs to be manually handled in order to ensure -data safety; however, both these assumptions are false: +One main assumption made was that disk failures should be treated as +'rare' events, and that each of them needs to be manually handled in +order to ensure data safety; however, both these assumptions are false: -- disk failures can be a common occurrence, based on usage patterns or cluster - size -- our disk setup is robust enough (referring to DRBD8 + LVM) that we could - automate more of the recovery +- disk failures can be a common occurrence, based on usage patterns or + cluster size +- our disk setup is robust enough (referring to DRBD8 + LVM) that we + could automate more of the recovery -Note that we still don't have fully-automated disk recovery as a goal, but our -goal is to reduce the manual work needed. +Note that we still don't have fully-automated disk recovery as a goal, +but our goal is to reduce the manual work needed. As such, we plan the following main changes: -- DRBD8 is much more flexible and stable than its previous version (0.7), - such that removing the support for the ``remote_raid1`` template and - focusing only on DRBD8 is easier +- DRBD8 is much more flexible and stable than its previous version + (0.7), such that removing the support for the ``remote_raid1`` + template and focusing only on DRBD8 is easier -- dynamic discovery of DRBD devices is not actually needed in a cluster that - where the DRBD namespace is controlled by Ganeti; switching to a static - assignment (done at either instance creation time or change secondary time) - will change the disk activation time from O(n) to O(1), which on big - clusters is a significant gain +- dynamic discovery of DRBD devices is not actually needed in a cluster + that where the DRBD namespace is controlled by Ganeti; switching to a + static assignment (done at either instance creation time or change + secondary time) will change the disk activation time from O(n) to + O(1), which on big clusters is a significant gain -- remove the hard dependency on LVM (currently all available storage types are - ultimately backed by LVM volumes) by introducing file-based storage +- remove the hard dependency on LVM (currently all available storage + types are ultimately backed by LVM volumes) by introducing file-based + storage Additionally, a number of smaller enhancements are also planned: - support variable number of disks @@ -1326,8 +1344,8 @@ With a modified disk activation sequence, we can implement the *failover to any* functionality, removing many of the layout restrictions of a cluster: -- the need to reserve memory on the current secondary: this gets reduced to - a must to reserve memory anywhere on the cluster +- the need to reserve memory on the current secondary: this gets reduced + to a must to reserve memory anywhere on the cluster - the need to first failover and then replace secondary for an instance: with failover-to-any, we can directly failover to @@ -1340,7 +1358,8 @@ is fixed to the node the user chooses, but the choice of S2 can be made between P1 and S1. This choice can be constrained, depending on which of P1 and S1 has failed. -- if P1 has failed, then S1 must become S2, and live migration is not possible +- if P1 has failed, then S1 must become S2, and live migration is not + possible - if S1 has failed, then P1 must become S2, and live migration could be possible (in theory, but this is not a design goal for 2.0) @@ -1349,13 +1368,13 @@ The algorithm for performing the failover is straightforward: - verify that S2 (the node the user has chosen to keep as secondary) has valid data (is consistent) -- tear down the current DRBD association and setup a DRBD pairing between - P2 (P2 is indicated by the user) and S2; since P2 has no data, it will - start re-syncing from S2 +- tear down the current DRBD association and setup a DRBD pairing + between P2 (P2 is indicated by the user) and S2; since P2 has no data, + it will start re-syncing from S2 -- as soon as P2 is in state SyncTarget (i.e. after the resync has started - but before it has finished), we can promote it to primary role (r/w) - and start the instance on P2 +- as soon as P2 is in state SyncTarget (i.e. after the resync has + started but before it has finished), we can promote it to primary role + (r/w) and start the instance on P2 - as soon as the P2?S2 sync has finished, we can remove the old data on the old node that has not been chosen for @@ -1426,10 +1445,10 @@ changes. OS interface ~~~~~~~~~~~~ -The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The -interface is composed by a series of scripts which get called with certain -parameters to perform OS-dependent operations on the cluster. The current -scripts are: +The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. +The interface is composed by a series of scripts which get called with +certain parameters to perform OS-dependent operations on the cluster. +The current scripts are: create called when a new instance is added to the cluster @@ -1441,18 +1460,20 @@ rename called to perform the os-specific operations necessary for renaming an instance -Currently these scripts suffer from the limitations of Ganeti 1.2: for example -they accept exactly one block and one swap devices to operate on, rather than -any amount of generic block devices, they blindly assume that an instance will -have just one network interface to operate, they can not be configured to -optimise the instance for a particular hypervisor. +Currently these scripts suffer from the limitations of Ganeti 1.2: for +example they accept exactly one block and one swap devices to operate +on, rather than any amount of generic block devices, they blindly assume +that an instance will have just one network interface to operate, they +can not be configured to optimise the instance for a particular +hypervisor. -Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed -number of network and disks the OS interface need to change to transmit the -appropriate amount of information about an instance to its managing operating -system, when operating on it. Moreover since some old assumptions usually used -in OS scripts are no longer valid we need to re-establish a common knowledge on -what can be assumed and what cannot be regarding Ganeti environment. +Since in Ganeti 2.0 we want to support multiple hypervisors, and a +non-fixed number of network and disks the OS interface need to change to +transmit the appropriate amount of information about an instance to its +managing operating system, when operating on it. Moreover since some old +assumptions usually used in OS scripts are no longer valid we need to +re-establish a common knowledge on what can be assumed and what cannot +be regarding Ganeti environment. When designing the new OS API our priorities are: @@ -1461,64 +1482,66 @@ When designing the new OS API our priorities are: - ease of porting from the old API - modularity -As such we want to limit the number of scripts that must be written to support -an OS, and make it easy to share code between them by uniforming their input. -We also will leave the current script structure unchanged, as far as we can, -and make a few of the scripts (import, export and rename) optional. Most -information will be passed to the script through environment variables, for -ease of access and at the same time ease of using only the information a script -needs. +As such we want to limit the number of scripts that must be written to +support an OS, and make it easy to share code between them by uniforming +their input. We also will leave the current script structure unchanged, +as far as we can, and make a few of the scripts (import, export and +rename) optional. Most information will be passed to the script through +environment variables, for ease of access and at the same time ease of +using only the information a script needs. The Scripts +++++++++++ -As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to -support the following functionality, through scripts: +As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs +to support the following functionality, through scripts: create: - used to create a new instance running that OS. This script should prepare the - block devices, and install them so that the new OS can boot under the - specified hypervisor. + used to create a new instance running that OS. This script should + prepare the block devices, and install them so that the new OS can + boot under the specified hypervisor. export (optional): - used to export an installed instance using the given OS to a format which can - be used to import it back into a new instance. + used to export an installed instance using the given OS to a format + which can be used to import it back into a new instance. import (optional): - used to import an exported instance into a new one. This script is similar to - create, but the new instance should have the content of the export, rather - than contain a pristine installation. + used to import an exported instance into a new one. This script is + similar to create, but the new instance should have the content of the + export, rather than contain a pristine installation. rename (optional): - used to perform the internal OS-specific operations needed to rename an - instance. + used to perform the internal OS-specific operations needed to rename + an instance. -If any optional script is not implemented Ganeti will refuse to perform the -given operation on instances using the non-implementing OS. Of course the -create script is mandatory, and it doesn't make sense to support the either the -export or the import operation but not both. +If any optional script is not implemented Ganeti will refuse to perform +the given operation on instances using the non-implementing OS. Of +course the create script is mandatory, and it doesn't make sense to +support the either the export or the import operation but not both. Incompatibilities with 1.2 __________________________ -We expect the following incompatibilities between the OS scripts for 1.2 and -the ones for 2.0: +We expect the following incompatibilities between the OS scripts for 1.2 +and the ones for 2.0: -- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll - use environment variables, as there will be a lot more information and not - all OSes may care about all of it. -- Number of calls: export scripts will be called once for each device the - instance has, and import scripts once for every exported disk. Imported - instances will be forced to have a number of disks greater or equal to the - one of the export. -- Some scripts are not compulsory: if such a script is missing the relevant - operations will be forbidden for instances of that OS. This makes it easier - to distinguish between unsupported operations and no-op ones (if any). +- Input parameters: in 1.2 those were passed on the command line, in 2.0 + we'll use environment variables, as there will be a lot more + information and not all OSes may care about all of it. +- Number of calls: export scripts will be called once for each device + the instance has, and import scripts once for every exported disk. + Imported instances will be forced to have a number of disks greater or + equal to the one of the export. +- Some scripts are not compulsory: if such a script is missing the + relevant operations will be forbidden for instances of that OS. This + makes it easier to distinguish between unsupported operations and + no-op ones (if any). Input _____ -Rather than using command line flags, as they do now, scripts will accept -inputs from environment variables. We expect the following input values: +Rather than using command line flags, as they do now, scripts will +accept inputs from environment variables. We expect the following input +values: OS_API_VERSION The version of the OS API that the following parameters comply with; @@ -1528,7 +1551,8 @@ OS_API_VERSION INSTANCE_NAME Name of the instance acted on HYPERVISOR - The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm') + The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', + 'kvm') DISK_COUNT The number of disks this instance will have NIC_COUNT @@ -1539,7 +1563,8 @@ DISK_<N>_ACCESS W if read/write, R if read only. OS scripts are not supposed to touch read-only disks, but will be passed them to know. DISK_<N>_FRONTEND_TYPE - Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' + Type of the disk as seen by the instance. Can be 'scsi', 'ide', + 'virtio' DISK_<N>_BACKEND_TYPE Type of the disk as seen from the node. Can be 'block', 'file:loop' or 'file:blktap' @@ -1553,8 +1578,8 @@ NIC_<N>_FRONTEND_TYPE Type of the Nth NIC as seen by the instance. For example 'virtio', 'rtl8139', etc. DEBUG_LEVEL - Whether more out should be produced, for debugging purposes. Currently the - only valid values are 0 and 1. + Whether more out should be produced, for debugging purposes. Currently + the only valid values are 0 and 1. These are only the basic variables we are thinking of now, but more may come during the implementation and they will be documented in the @@ -1567,30 +1592,33 @@ per-script variables, such as for example: OLD_INSTANCE_NAME rename: the name the instance should be renamed from. EXPORT_DEVICE - export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. + export: device to be exported, a snapshot of the actual device. The + data must be exported to stdout. EXPORT_INDEX export: sequential number of the instance device targeted. IMPORT_DEVICE - import: device to send the data to, part of the new instance. The data must be imported from stdin. + import: device to send the data to, part of the new instance. The data + must be imported from stdin. IMPORT_INDEX import: sequential number of the instance device targeted. -(Rationale for INSTANCE_NAME as an environment variable: the instance name is -always needed and we could pass it on the command line. On the other hand, -though, this would force scripts to both access the environment and parse the -command line, so we'll move it for uniformity.) +(Rationale for INSTANCE_NAME as an environment variable: the instance +name is always needed and we could pass it on the command line. On the +other hand, though, this would force scripts to both access the +environment and parse the command line, so we'll move it for +uniformity.) Output/Behaviour ________________ -As discussed scripts should only send user-targeted information to stderr. The -create and import scripts are supposed to format/initialise the given block -devices and install the correct instance data. The export script is supposed to -export instance data to stdout in a format understandable by the the import -script. The data will be compressed by Ganeti, so no compression should be -done. The rename script should only modify the instance's knowledge of what -its name is. +As discussed scripts should only send user-targeted information to +stderr. The create and import scripts are supposed to format/initialise +the given block devices and install the correct instance data. The +export script is supposed to export instance data to stdout in a format +understandable by the the import script. The data will be compressed by +Ganeti, so no compression should be done. The rename script should only +modify the instance's knowledge of what its name is. Other declarative style features ++++++++++++++++++++++++++++++++ @@ -1604,22 +1632,23 @@ so an OS supporting both version 5 and version 20 will have a file containing two lines. This is different from Ganeti 1.2, which only supported one version number. -In addition to that an OS will be able to declare that it does support only a -subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file. +In addition to that an OS will be able to declare that it does support +only a subset of the Ganeti hypervisors, by declaring them in the +'hypervisors' file. Caveats/Notes +++++++++++++ -We might want to have a "default" import/export behaviour that just dumps all -disks and restores them. This can save work as most systems will just do this, -while allowing flexibility for different systems. +We might want to have a "default" import/export behaviour that just +dumps all disks and restores them. This can save work as most systems +will just do this, while allowing flexibility for different systems. -Environment variables are limited in size, but we expect that there will be -enough space to store the information we need. If we discover that this is not -the case we may want to go to a more complex API such as storing those -information on the filesystem and providing the OS script with the path to a -file where they are encoded in some format. +Environment variables are limited in size, but we expect that there will +be enough space to store the information we need. If we discover that +this is not the case we may want to go to a more complex API such as +storing those information on the filesystem and providing the OS script +with the path to a file where they are encoded in some format. diff --git a/doc/design-2.1.rst b/doc/design-2.1.rst index db52d43c859894c8a1a78b7aa34dcba950be45f4..67966e5c63edbdb4fed42a4a37236fbdf76d0075 100644 --- a/doc/design-2.1.rst +++ b/doc/design-2.1.rst @@ -5,9 +5,9 @@ Ganeti 2.1 design This document describes the major changes in Ganeti 2.1 compared to the 2.0 version. -The 2.1 version will be a relatively small release. Its main aim is to avoid -changing too much of the core code, while addressing issues and adding new -features and improvements over 2.0, in a timely fashion. +The 2.1 version will be a relatively small release. Its main aim is to +avoid changing too much of the core code, while addressing issues and +adding new features and improvements over 2.0, in a timely fashion. .. contents:: :depth: 4 @@ -15,8 +15,8 @@ Objective ========= Ganeti 2.1 will add features to help further automatization of cluster -operations, further improbe scalability to even bigger clusters, and make it -easier to debug the Ganeti core. +operations, further improbe scalability to even bigger clusters, and +make it easier to debug the Ganeti core. Background ========== @@ -29,8 +29,8 @@ Detailed design As for 2.0 we divide the 2.1 design into three areas: -- core changes, which affect the master daemon/job queue/locking or all/most - logical units +- core changes, which affect the master daemon/job queue/locking or + all/most logical units - logical unit/feature changes - external interface changes (eg. command line, os api, hooks, ...) @@ -60,7 +60,8 @@ will provide, like: - list of storage units of this type - check status of the storage unit -Additionally, there will be specific methods for each method, for example: +Additionally, there will be specific methods for each method, for +example: - enable/disable allocations on a specific PV - file storage directory creation/deletion @@ -88,22 +89,22 @@ Current State and shortcomings ++++++++++++++++++++++++++++++ The class ``LockSet`` (see ``lib/locking.py``) is a container for one or -many ``SharedLock`` instances. It provides an interface to add/remove locks -and to acquire and subsequently release any number of those locks contained -in it. +many ``SharedLock`` instances. It provides an interface to add/remove +locks and to acquire and subsequently release any number of those locks +contained in it. -Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the -way we're using locks for nodes and instances (the single cluster lock isn't -affected by this issue) this can lead to long delays when acquiring locks if -another operation tries to acquire multiple locks but has to wait for yet -another operation. +Locks in a ``LockSet`` are always acquired in alphabetic order. Due to +the way we're using locks for nodes and instances (the single cluster +lock isn't affected by this issue) this can lead to long delays when +acquiring locks if another operation tries to acquire multiple locks but +has to wait for yet another operation. In the following demonstration we assume to have the instance locks ``inst1``, ``inst2``, ``inst3`` and ``inst4``. #. Operation A grabs lock for instance ``inst4``. -#. Operation B wants to acquire all instance locks in alphabetic order, but - it has to wait for ``inst4``. +#. Operation B wants to acquire all instance locks in alphabetic order, + but it has to wait for ``inst4``. #. Operation C tries to lock ``inst1``, but it has to wait until Operation B (which is trying to acquire all locks) releases the lock again. @@ -121,45 +122,47 @@ Proposed changes Non-blocking lock acquiring ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Acquiring locks for OpCode execution is always done in blocking mode. They -won't return until the lock has successfully been acquired (or an error -occurred, although we won't cover that case here). +Acquiring locks for OpCode execution is always done in blocking mode. +They won't return until the lock has successfully been acquired (or an +error occurred, although we won't cover that case here). -``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking -way. They must support a timeout and abort trying to acquire the lock(s) -after the specified amount of time. +``SharedLock`` and ``LockSet`` must be able to be acquired in a +non-blocking way. They must support a timeout and abort trying to +acquire the lock(s) after the specified amount of time. Retry acquiring locks ^^^^^^^^^^^^^^^^^^^^^ -To prevent other operations from waiting for a long time, such as described -in the demonstration before, ``LockSet`` must not keep locks for a prolonged -period of time when trying to acquire two or more locks. Instead it should, -with an increasing timeout for acquiring all locks, release all locks again -and sleep some time if it fails to acquire all requested locks. +To prevent other operations from waiting for a long time, such as +described in the demonstration before, ``LockSet`` must not keep locks +for a prolonged period of time when trying to acquire two or more locks. +Instead it should, with an increasing timeout for acquiring all locks, +release all locks again and sleep some time if it fails to acquire all +requested locks. -A good timeout value needs to be determined. In any case should ``LockSet`` -proceed to acquire locks in blocking mode after a few (unsuccessful) -attempts to acquire all requested locks. +A good timeout value needs to be determined. In any case should +``LockSet`` proceed to acquire locks in blocking mode after a few +(unsuccessful) attempts to acquire all requested locks. -One proposal for the timeout is to use ``2**tries`` seconds, where ``tries`` -is the number of unsuccessful tries. +One proposal for the timeout is to use ``2**tries`` seconds, where +``tries`` is the number of unsuccessful tries. -In the demonstration before this would allow Operation C to continue after -Operation B unsuccessfully tried to acquire all locks and released all -acquired locks (``inst1``, ``inst2`` and ``inst3``) again. +In the demonstration before this would allow Operation C to continue +after Operation B unsuccessfully tried to acquire all locks and released +all acquired locks (``inst1``, ``inst2`` and ``inst3``) again. Other solutions discussed +++++++++++++++++++++++++ -There was also some discussion on going one step further and extend the job -queue (see ``lib/jqueue.py``) to select the next task for a worker depending -on whether it can acquire the necessary locks. While this may reduce the -number of necessary worker threads and/or increase throughput on large -clusters with many jobs, it also brings many potential problems, such as -contention and increased memory usage, with it. As this would be an -extension of the changes proposed before it could be implemented at a later -point in time, but we decided to stay with the simpler solution for now. +There was also some discussion on going one step further and extend the +job queue (see ``lib/jqueue.py``) to select the next task for a worker +depending on whether it can acquire the necessary locks. While this may +reduce the number of necessary worker threads and/or increase throughput +on large clusters with many jobs, it also brings many potential +problems, such as contention and increased memory usage, with it. As +this would be an extension of the changes proposed before it could be +implemented at a later point in time, but we decided to stay with the +simpler solution for now. Implementation details ++++++++++++++++++++++ @@ -169,64 +172,68 @@ Implementation details The current design of ``SharedLock`` is not good for supporting timeouts when acquiring a lock and there are also minor fairness issues in it. We -plan to address both with a redesign. A proof of concept implementation was -written and resulted in significantly simpler code. - -Currently ``SharedLock`` uses two separate queues for shared and exclusive -acquires and waiters get to run in turns. This means if an exclusive acquire -is released, the lock will allow shared waiters to run and vice versa. -Although it's still fair in the end there is a slight bias towards shared -waiters in the current implementation. The same implementation with two -shared queues can not support timeouts without adding a lot of complexity. - -Our proposed redesign changes ``SharedLock`` to have only one single queue. -There will be one condition (see Condition_ for a note about performance) in -the queue per exclusive acquire and two for all shared acquires (see below for -an explanation). The maximum queue length will always be ``2 + (number of -exclusive acquires waiting)``. The number of queue entries for shared acquires -can vary from 0 to 2. - -The two conditions for shared acquires are a bit special. They will be used -in turn. When the lock is instantiated, no conditions are in the queue. As -soon as the first shared acquire arrives (and there are holder(s) or waiting -acquires; see Acquire_), the active condition is added to the queue. Until -it becomes the topmost condition in the queue and has been notified, any -shared acquire is added to this active condition. When the active condition -is notified, the conditions are swapped and further shared acquires are -added to the previously inactive condition (which has now become the active -condition). After all waiters on the previously active (now inactive) and -now notified condition received the notification, it is removed from the -queue of pending acquires. - -This means shared acquires will skip any exclusive acquire in the queue. We -believe it's better to improve parallelization on operations only asking for -shared (or read-only) locks. Exclusive operations holding the same lock can -not be parallelized. +plan to address both with a redesign. A proof of concept implementation +was written and resulted in significantly simpler code. + +Currently ``SharedLock`` uses two separate queues for shared and +exclusive acquires and waiters get to run in turns. This means if an +exclusive acquire is released, the lock will allow shared waiters to run +and vice versa. Although it's still fair in the end there is a slight +bias towards shared waiters in the current implementation. The same +implementation with two shared queues can not support timeouts without +adding a lot of complexity. + +Our proposed redesign changes ``SharedLock`` to have only one single +queue. There will be one condition (see Condition_ for a note about +performance) in the queue per exclusive acquire and two for all shared +acquires (see below for an explanation). The maximum queue length will +always be ``2 + (number of exclusive acquires waiting)``. The number of +queue entries for shared acquires can vary from 0 to 2. + +The two conditions for shared acquires are a bit special. They will be +used in turn. When the lock is instantiated, no conditions are in the +queue. As soon as the first shared acquire arrives (and there are +holder(s) or waiting acquires; see Acquire_), the active condition is +added to the queue. Until it becomes the topmost condition in the queue +and has been notified, any shared acquire is added to this active +condition. When the active condition is notified, the conditions are +swapped and further shared acquires are added to the previously inactive +condition (which has now become the active condition). After all waiters +on the previously active (now inactive) and now notified condition +received the notification, it is removed from the queue of pending +acquires. + +This means shared acquires will skip any exclusive acquire in the queue. +We believe it's better to improve parallelization on operations only +asking for shared (or read-only) locks. Exclusive operations holding the +same lock can not be parallelized. Acquire ******* -For exclusive acquires a new condition is created and appended to the queue. -Shared acquires are added to the active condition for shared acquires and if -the condition is not yet on the queue, it's appended. +For exclusive acquires a new condition is created and appended to the +queue. Shared acquires are added to the active condition for shared +acquires and if the condition is not yet on the queue, it's appended. -The next step is to wait for our condition to be on the top of the queue (to -guarantee fairness). If the timeout expired, we return to the caller without -acquiring the lock. On every notification we check whether the lock has been -deleted, in which case an error is returned to the caller. +The next step is to wait for our condition to be on the top of the queue +(to guarantee fairness). If the timeout expired, we return to the caller +without acquiring the lock. On every notification we check whether the +lock has been deleted, in which case an error is returned to the caller. -The lock can be acquired if we're on top of the queue (there is no one else -ahead of us). For an exclusive acquire, there must not be other exclusive or -shared holders. For a shared acquire, there must not be an exclusive holder. -If these conditions are all true, the lock is acquired and we return to the -caller. In any other case we wait again on the condition. +The lock can be acquired if we're on top of the queue (there is no one +else ahead of us). For an exclusive acquire, there must not be other +exclusive or shared holders. For a shared acquire, there must not be an +exclusive holder. If these conditions are all true, the lock is +acquired and we return to the caller. In any other case we wait again on +the condition. -If it was the last waiter on a condition, the condition is removed from the -queue. +If it was the last waiter on a condition, the condition is removed from +the queue. Optimization: There's no need to touch the queue if there are no pending -acquires and no current holders. The caller can have the lock immediately. +acquires and no current holders. The caller can have the lock +immediately. .. image:: design-2.1-lock-acquire.png @@ -234,12 +241,14 @@ acquires and no current holders. The caller can have the lock immediately. Release ******* -First the lock removes the caller from the internal owner list. If there are -pending acquires in the queue, the first (the oldest) condition is notified. +First the lock removes the caller from the internal owner list. If there +are pending acquires in the queue, the first (the oldest) condition is +notified. If the first condition was the active condition for shared acquires, the -inactive condition will be made active. This ensures fairness with exclusive -locks by forcing consecutive shared acquires to wait in the queue. +inactive condition will be made active. This ensures fairness with +exclusive locks by forcing consecutive shared acquires to wait in the +queue. .. image:: design-2.1-lock-release.png @@ -247,40 +256,40 @@ locks by forcing consecutive shared acquires to wait in the queue. Delete ****** -The caller must either hold the lock in exclusive mode already or the lock -must be acquired in exclusive mode. Trying to delete a lock while it's held -in shared mode must fail. +The caller must either hold the lock in exclusive mode already or the +lock must be acquired in exclusive mode. Trying to delete a lock while +it's held in shared mode must fail. -After ensuring the lock is held in exclusive mode, the lock will mark itself -as deleted and continue to notify all pending acquires. They will wake up, -notice the deleted lock and return an error to the caller. +After ensuring the lock is held in exclusive mode, the lock will mark +itself as deleted and continue to notify all pending acquires. They will +wake up, notice the deleted lock and return an error to the caller. Condition ^^^^^^^^^ -Note: This is not necessary for the locking changes above, but it may be a -good optimization (pending performance tests). +Note: This is not necessary for the locking changes above, but it may be +a good optimization (pending performance tests). The existing locking code in Ganeti 2.0 uses Python's built-in ``threading.Condition`` class. Unfortunately ``Condition`` implements -timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock -in non-blocking mode. This requires unnecessary context switches and -contention on the CPython GIL (Global Interpreter Lock). +timeouts by sleeping 1ms to 20ms between tries to acquire the condition +lock in non-blocking mode. This requires unnecessary context switches +and contention on the CPython GIL (Global Interpreter Lock). By using POSIX pipes (see ``pipe(2)``) we can use the operating system's support for timeouts on file descriptors (see ``select(2)``). A custom condition class will have to be written for this. On instantiation the class creates a pipe. After each notification the -previous pipe is abandoned and re-created (technically the old pipe needs to -stay around until all notifications have been delivered). +previous pipe is abandoned and re-created (technically the old pipe +needs to stay around until all notifications have been delivered). All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to -wait for notifications, optionally with a timeout. A notification will be -signalled to the waiting clients by closing the pipe. If the pipe wasn't -closed during the timeout, the waiting function returns to its caller -nonetheless. +wait for notifications, optionally with a timeout. A notification will +be signalled to the waiting clients by closing the pipe. If the pipe +wasn't closed during the timeout, the waiting function returns to its +caller nonetheless. Feature changes @@ -291,50 +300,53 @@ Ganeti Confd Current State and shortcomings ++++++++++++++++++++++++++++++ -In Ganeti 2.0 all nodes are equal, but some are more equal than others. In -particular they are divided between "master", "master candidates" and "normal". -(Moreover they can be offline or drained, but this is not important for the -current discussion). In general the whole configuration is only replicated to -master candidates, and some partial information is spread to all nodes via -ssconf. - -This change was done so that the most frequent Ganeti operations didn't need to -contact all nodes, and so clusters could become bigger. If we want more -information to be available on all nodes, we need to add more ssconf values, -which is counter-balancing the change, or to talk with the master node, which -is not designed to happen now, and requires its availability. - -Information such as the instance->primary_node mapping will be needed on all -nodes, and we also want to make sure services external to the cluster can query -this information as well. This information must be available at all times, so -we can't query it through RAPI, which would be a single point of failure, as -it's only available on the master. + +In Ganeti 2.0 all nodes are equal, but some are more equal than others. +In particular they are divided between "master", "master candidates" and +"normal". (Moreover they can be offline or drained, but this is not +important for the current discussion). In general the whole +configuration is only replicated to master candidates, and some partial +information is spread to all nodes via ssconf. + +This change was done so that the most frequent Ganeti operations didn't +need to contact all nodes, and so clusters could become bigger. If we +want more information to be available on all nodes, we need to add more +ssconf values, which is counter-balancing the change, or to talk with +the master node, which is not designed to happen now, and requires its +availability. + +Information such as the instance->primary_node mapping will be needed on +all nodes, and we also want to make sure services external to the +cluster can query this information as well. This information must be +available at all times, so we can't query it through RAPI, which would +be a single point of failure, as it's only available on the master. Proposed changes ++++++++++++++++ In order to allow fast and highly available access read-only to some -configuration values, we'll create a new ganeti-confd daemon, which will run on -master candidates. This daemon will talk via UDP, and authenticate messages -using HMAC with a cluster-wide shared key. This key will be generated at -cluster init time, and stored on the clusters alongside the ganeti SSL keys, -and readable only by root. - -An interested client can query a value by making a request to a subset of the -cluster master candidates. It will then wait to get a few responses, and use -the one with the highest configuration serial number. Since the configuration -serial number is increased each time the ganeti config is updated, and the -serial number is included in all answers, this can be used to make sure to use -the most recent answer, in case some master candidates are stale or in the -middle of a configuration update. +configuration values, we'll create a new ganeti-confd daemon, which will +run on master candidates. This daemon will talk via UDP, and +authenticate messages using HMAC with a cluster-wide shared key. This +key will be generated at cluster init time, and stored on the clusters +alongside the ganeti SSL keys, and readable only by root. + +An interested client can query a value by making a request to a subset +of the cluster master candidates. It will then wait to get a few +responses, and use the one with the highest configuration serial number. +Since the configuration serial number is increased each time the ganeti +config is updated, and the serial number is included in all answers, +this can be used to make sure to use the most recent answer, in case +some master candidates are stale or in the middle of a configuration +update. In order to prevent replay attacks queries will contain the current unix timestamp according to the client, and the server will verify that its -timestamp is in the same 5 minutes range (this requires synchronized clocks, -which is a good idea anyway). Queries will also contain a "salt" which they -expect the answers to be sent with, and clients are supposed to accept only -answers which contain salt generated by them. +timestamp is in the same 5 minutes range (this requires synchronized +clocks, which is a good idea anyway). Queries will also contain a "salt" +which they expect the answers to be sent with, and clients are supposed +to accept only answers which contain salt generated by them. The configuration daemon will be able to answer simple queries such as: @@ -364,20 +376,21 @@ Detailed explanation of the various fields: - 'protocol', integer, is the confd protocol version (initially just constants.CONFD_PROTOCOL_VERSION, with a value of 1) - - 'type', integer, is the query type. For example "node role by name" or - "node primary ip by instance ip". Constants will be provided for the actual - available query types. - - 'query', string, is the search key. For example an ip, or a node name. - - 'rsalt', string, is the required response salt. The client must use it to - recognize which answer it's getting. - -- 'salt' must be the current unix timestamp, according to the client. Servers - can refuse messages which have a wrong timing, according to their - configuration and clock. + - 'type', integer, is the query type. For example "node role by name" + or "node primary ip by instance ip". Constants will be provided for + the actual available query types. + - 'query', string, is the search key. For example an ip, or a node + name. + - 'rsalt', string, is the required response salt. The client must use + it to recognize which answer it's getting. + +- 'salt' must be the current unix timestamp, according to the client. + Servers can refuse messages which have a wrong timing, according to + their configuration and clock. - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key -If an answer comes back (which is optional, since confd works over UDP) it will -be in this format:: +If an answer comes back (which is optional, since confd works over UDP) +it will be in this format:: { "msg": "{\"status\": 0, @@ -394,18 +407,18 @@ Where: - 'protocol', integer, is the confd protocol version (initially just constants.CONFD_PROTOCOL_VERSION, with a value of 1) - - 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for - 'error' (in which case answer contains an error detail, rather than an - answer), but in the future it may be expanded to have more meanings (eg: 2, - the answer is compressed) - - 'answer', is the actual answer. Its type and meaning is query specific. For - example for "node primary ip by instance ip" queries it will be a string - containing an IP address, for "node role by name" queries it will be an - integer which encodes the role (master, candidate, drained, offline) - according to constants. - -- 'salt' is the requested salt from the query. A client can use it to recognize - what query the answer is answering. + - 'status', integer, is the error code. Initially just 0 for 'ok' or + '1' for 'error' (in which case answer contains an error detail, + rather than an answer), but in the future it may be expanded to have + more meanings (eg: 2, the answer is compressed) + - 'answer', is the actual answer. Its type and meaning is query + specific. For example for "node primary ip by instance ip" queries + it will be a string containing an IP address, for "node role by + name" queries it will be an integer which encodes the role (master, + candidate, drained, offline) according to constants. + +- 'salt' is the requested salt from the query. A client can use it to + recognize what query the answer is answering. - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key @@ -414,40 +427,44 @@ Redistribute Config Current State and shortcomings ++++++++++++++++++++++++++++++ -Currently LURedistributeConfig triggers a copy of the updated configuration -file to all master candidates and of the ssconf files to all nodes. There are -other files which are maintained manually but which are important to keep in -sync. These are: + +Currently LURedistributeConfig triggers a copy of the updated +configuration file to all master candidates and of the ssconf files to +all nodes. There are other files which are maintained manually but which +are important to keep in sync. These are: - rapi SSL key certificate file (rapi.pem) (on master candidates) - rapi user/password file rapi_users (on master candidates) -Furthermore there are some files which are hypervisor specific but we may want -to keep in sync: +Furthermore there are some files which are hypervisor specific but we +may want to keep in sync: -- the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies - the file once, during node add. This design is subject to revision to be able - to have different passwords for different groups of instances via the use of - hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to - provide password-protected vnc sessions. In general, though, it would be - useful if the vnc password files were copied as well, to avoid unwanted vnc - password changes on instance failover/migrate. +- the xen-hvm hypervisor uses one shared file for all vnc passwords, and + copies the file once, during node add. This design is subject to + revision to be able to have different passwords for different groups + of instances via the use of hypervisor parameters, and to allow + xen-hvm and kvm to use an equal system to provide password-protected + vnc sessions. In general, though, it would be useful if the vnc + password files were copied as well, to avoid unwanted vnc password + changes on instance failover/migrate. -Optionally the admin may want to also ship files such as the global xend.conf -file, and the network scripts to all nodes. +Optionally the admin may want to also ship files such as the global +xend.conf file, and the network scripts to all nodes. Proposed changes ++++++++++++++++ -RedistributeConfig will be changed to copy also the rapi files, and to call -every enabled hypervisor asking for a list of additional files to copy. Users -will have the possibility to populate a file containing a list of files to be -distributed; this file will be propagated as well. Such solution is really -simple to implement and it's easily usable by scripts. +RedistributeConfig will be changed to copy also the rapi files, and to +call every enabled hypervisor asking for a list of additional files to +copy. Users will have the possibility to populate a file containing a +list of files to be distributed; this file will be propagated as well. +Such solution is really simple to implement and it's easily usable by +scripts. -This code will be also shared (via tasklets or by other means, if tasklets are -not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant -files will be automatically shipped to new master candidates as they are set). +This code will be also shared (via tasklets or by other means, if +tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs +(so that the relevant files will be automatically shipped to new master +candidates as they are set). VNC Console Password ~~~~~~~~~~~~~~~~~~~~ @@ -455,28 +472,31 @@ VNC Console Password Current State and shortcomings ++++++++++++++++++++++++++++++ -Currently just the xen-hvm hypervisor supports setting a password to connect -the the instances' VNC console, and has one common password stored in a file. +Currently just the xen-hvm hypervisor supports setting a password to +connect the the instances' VNC console, and has one common password +stored in a file. This doesn't allow different passwords for different instances/groups of -instances, and makes it necessary to remember to copy the file around the -cluster when the password changes. +instances, and makes it necessary to remember to copy the file around +the cluster when the password changes. Proposed changes ++++++++++++++++ -We'll change the VNC password file to a vnc_password_file hypervisor parameter. -This way it can have a cluster default, but also a different value for each -instance. The VNC enabled hypervisors (xen and kvm) will publish all the -password files in use through the cluster so that a redistribute-config will -ship them to all nodes (see the Redistribute Config proposed changes above). +We'll change the VNC password file to a vnc_password_file hypervisor +parameter. This way it can have a cluster default, but also a different +value for each instance. The VNC enabled hypervisors (xen and kvm) will +publish all the password files in use through the cluster so that a +redistribute-config will ship them to all nodes (see the Redistribute +Config proposed changes above). -The current VNC_PASSWORD_FILE constant will be removed, but its value will be -used as the default HV_VNC_PASSWORD_FILE value, thus retaining backwards -compatibility with 2.0. +The current VNC_PASSWORD_FILE constant will be removed, but its value +will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining +backwards compatibility with 2.0. -The code to export the list of VNC password files from the hypervisors to -RedistributeConfig will be shared between the KVM and xen-hvm hypervisors. +The code to export the list of VNC password files from the hypervisors +to RedistributeConfig will be shared between the KVM and xen-hvm +hypervisors. Disk/Net parameters ~~~~~~~~~~~~~~~~~~~ @@ -484,25 +504,27 @@ Disk/Net parameters Current State and shortcomings ++++++++++++++++++++++++++++++ -Currently disks and network interfaces have a few tweakable options and all the -rest is left to a default we chose. We're finding that we need more and more to -tweak some of these parameters, for example to disable barriers for DRBD -devices, or allow striping for the LVM volumes. +Currently disks and network interfaces have a few tweakable options and +all the rest is left to a default we chose. We're finding that we need +more and more to tweak some of these parameters, for example to disable +barriers for DRBD devices, or allow striping for the LVM volumes. -Moreover for many of these parameters it will be nice to have cluster-wide -defaults, and then be able to change them per disk/interface. +Moreover for many of these parameters it will be nice to have +cluster-wide defaults, and then be able to change them per +disk/interface. Proposed changes ++++++++++++++++ -We will add new cluster level diskparams and netparams, which will contain all -the tweakable parameters. All values which have a sensible cluster-wide default -will go into this new structure while parameters which have unique values will not. +We will add new cluster level diskparams and netparams, which will +contain all the tweakable parameters. All values which have a sensible +cluster-wide default will go into this new structure while parameters +which have unique values will not. Example of network parameters: - mode: bridge/route - - link: for mode "bridge" the bridge to connect to, for mode route it can - contain the routing table, or the destination interface + - link: for mode "bridge" the bridge to connect to, for mode route it + can contain the routing table, or the destination interface Example of disk parameters: - stripe: lvm stripes @@ -510,16 +532,17 @@ Example of disk parameters: - meta_flushes: drbd, enable/disable metadata "barriers" - data_flushes: drbd, enable/disable data "barriers" -Some parameters are bound to be disk-type specific (drbd, vs lvm, vs files) or -hypervisor specific (nic models for example), but for now they will all live in -the same structure. Each component is supposed to validate only the parameters -it knows about, and ganeti itself will make sure that no "globally unknown" -parameters are added, and that no parameters have overridden meanings for -different components. +Some parameters are bound to be disk-type specific (drbd, vs lvm, vs +files) or hypervisor specific (nic models for example), but for now they +will all live in the same structure. Each component is supposed to +validate only the parameters it knows about, and ganeti itself will make +sure that no "globally unknown" parameters are added, and that no +parameters have overridden meanings for different components. -The parameters will be kept, as for the BEPARAMS into a "default" category, -which will allow us to expand on by creating instance "classes" in the future. -Instance classes is not a feature we plan implementing in 2.1, though. +The parameters will be kept, as for the BEPARAMS into a "default" +category, which will allow us to expand on by creating instance +"classes" in the future. Instance classes is not a feature we plan +implementing in 2.1, though. Non bridged instances support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -527,33 +550,34 @@ Non bridged instances support Current State and shortcomings ++++++++++++++++++++++++++++++ -Currently each instance NIC must be connected to a bridge, and if the bridge is -not specified the default cluster one is used. This makes it impossible to use -the vif-route xen network scripts, or other alternative mechanisms that don't -need a bridge to work. +Currently each instance NIC must be connected to a bridge, and if the +bridge is not specified the default cluster one is used. This makes it +impossible to use the vif-route xen network scripts, or other +alternative mechanisms that don't need a bridge to work. Proposed changes ++++++++++++++++ -The new "mode" network parameter will distinguish between bridged interfaces -and routed ones. +The new "mode" network parameter will distinguish between bridged +interfaces and routed ones. -When mode is "bridge" the "link" parameter will contain the bridge the instance -should be connected to, effectively making things as today. The value has been -migrated from a nic field to a parameter to allow for an easier manipulation of -the cluster default. +When mode is "bridge" the "link" parameter will contain the bridge the +instance should be connected to, effectively making things as today. The +value has been migrated from a nic field to a parameter to allow for an +easier manipulation of the cluster default. -When mode is "route" the ip field of the interface will become mandatory, to -allow for a route to be set. In the future we may want also to accept multiple -IPs or IP/mask values for this purpose. We will evaluate possible meanings of -the link parameter to signify a routing table to be used, which would allow for -insulation between instance groups (as today happens for different bridges). +When mode is "route" the ip field of the interface will become +mandatory, to allow for a route to be set. In the future we may want +also to accept multiple IPs or IP/mask values for this purpose. We will +evaluate possible meanings of the link parameter to signify a routing +table to be used, which would allow for insulation between instance +groups (as today happens for different bridges). -For now we won't add a parameter to specify which network script gets called -for which instance, so in a mixed cluster the network script must be able to -handle both cases. The default kvm vif script will be changed to do so. (Xen -doesn't have a ganeti provided script, so nothing will be done for that -hypervisor) +For now we won't add a parameter to specify which network script gets +called for which instance, so in a mixed cluster the network script must +be able to handle both cases. The default kvm vif script will be changed +to do so. (Xen doesn't have a ganeti provided script, so nothing will be +done for that hypervisor) Introducing persistent UUIDs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -612,59 +636,59 @@ require a complete lock of all instances. Automated disk repairs infrastructure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Replacing defective disks in an automated fashion is quite difficult with the -current version of Ganeti. These changes will introduce additional -functionality and interfaces to simplify automating disk replacements on a -Ganeti node. +Replacing defective disks in an automated fashion is quite difficult +with the current version of Ganeti. These changes will introduce +additional functionality and interfaces to simplify automating disk +replacements on a Ganeti node. Fix node volume group +++++++++++++++++++++ -This is the most difficult addition, as it can lead to dataloss if it's not -properly safeguarded. +This is the most difficult addition, as it can lead to dataloss if it's +not properly safeguarded. -The operation must be done only when all the other nodes that have instances in -common with the target node are fine, i.e. this is the only node with problems, -and also we have to double-check that all instances on this node have at least -a good copy of the data. +The operation must be done only when all the other nodes that have +instances in common with the target node are fine, i.e. this is the only +node with problems, and also we have to double-check that all instances +on this node have at least a good copy of the data. This might mean that we have to enhance the GetMirrorStatus calls, and -introduce and a smarter version that can tell us more about the status of an -instance. +introduce and a smarter version that can tell us more about the status +of an instance. Stop allocation on a given PV +++++++++++++++++++++++++++++ -This is somewhat simple. First we need a "list PVs" opcode (and its associated -logical unit) and then a set PV status opcode/LU. These in combination should -allow both checking and changing the disk/PV status. +This is somewhat simple. First we need a "list PVs" opcode (and its +associated logical unit) and then a set PV status opcode/LU. These in +combination should allow both checking and changing the disk/PV status. Instance disk status ++++++++++++++++++++ -This new opcode or opcode change must list the instance-disk-index and node -combinations of the instance together with their status. This will allow -determining what part of the instance is broken (if any). +This new opcode or opcode change must list the instance-disk-index and +node combinations of the instance together with their status. This will +allow determining what part of the instance is broken (if any). Repair instance +++++++++++++++ -This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in order -to fix the instance status. It only affects primary instances; secondaries can -just be moved away. +This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in +order to fix the instance status. It only affects primary instances; +secondaries can just be moved away. Migrate node ++++++++++++ -This new opcode/LU/RAPI call will take over the current ``gnt-node migrate`` -code and run migrate for all instances on the node. +This new opcode/LU/RAPI call will take over the current ``gnt-node +migrate`` code and run migrate for all instances on the node. Evacuate node ++++++++++++++ -This new opcode/LU/RAPI call will take over the current ``gnt-node evacuate`` -code and run replace-secondary with an iallocator script for all instances on -the node. +This new opcode/LU/RAPI call will take over the current ``gnt-node +evacuate`` code and run replace-secondary with an iallocator script for +all instances on the node. External interface changes @@ -673,90 +697,92 @@ External interface changes OS API ~~~~~~ -The OS API of Ganeti 2.0 has been built with extensibility in mind. Since we -pass everything as environment variables it's a lot easier to send new -information to the OSes without breaking retrocompatibility. This section of -the design outlines the proposed extensions to the API and their -implementation. +The OS API of Ganeti 2.0 has been built with extensibility in mind. +Since we pass everything as environment variables it's a lot easier to +send new information to the OSes without breaking retrocompatibility. +This section of the design outlines the proposed extensions to the API +and their implementation. API Version Compatibility Handling ++++++++++++++++++++++++++++++++++ -In 2.1 there will be a new OS API version (eg. 15), which should be mostly -compatible with api 10, except for some new added variables. Since it's easy -not to pass some variables we'll be able to handle Ganeti 2.0 OSes by just -filtering out the newly added piece of information. We will still encourage -OSes to declare support for the new API after checking that the new variables -don't provide any conflict for them, and we will drop api 10 support after -ganeti 2.1 has released. +In 2.1 there will be a new OS API version (eg. 15), which should be +mostly compatible with api 10, except for some new added variables. +Since it's easy not to pass some variables we'll be able to handle +Ganeti 2.0 OSes by just filtering out the newly added piece of +information. We will still encourage OSes to declare support for the new +API after checking that the new variables don't provide any conflict for +them, and we will drop api 10 support after ganeti 2.1 has released. New Environment variables +++++++++++++++++++++++++ -Some variables have never been added to the OS api but would definitely be -useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable to allow -the OS to make changes relevant to the virtualization the instance is going to -use. Since this field is immutable for each instance, the os can tight the -install without caring of making sure the instance can run under any -virtualization technology. - -We also want the OS to know the particular hypervisor parameters, to be able to -customize the install even more. Since the parameters can change, though, we -will pass them only as an "FYI": if an OS ties some instance functionality to -the value of a particular hypervisor parameter manual changes or a reinstall -may be needed to adapt the instance to the new environment. This is not a -regression as of today, because even if the OSes are left blind about this -information, sometimes they still need to make compromises and cannot satisfy -all possible parameter values. +Some variables have never been added to the OS api but would definitely +be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable +to allow the OS to make changes relevant to the virtualization the +instance is going to use. Since this field is immutable for each +instance, the os can tight the install without caring of making sure the +instance can run under any virtualization technology. + +We also want the OS to know the particular hypervisor parameters, to be +able to customize the install even more. Since the parameters can +change, though, we will pass them only as an "FYI": if an OS ties some +instance functionality to the value of a particular hypervisor parameter +manual changes or a reinstall may be needed to adapt the instance to the +new environment. This is not a regression as of today, because even if +the OSes are left blind about this information, sometimes they still +need to make compromises and cannot satisfy all possible parameter +values. OS Variants +++++++++++ -Currently we are assisting to some degree of "os proliferation" just to change -a simple installation behavior. This means that the same OS gets installed on -the cluster multiple times, with different names, to customize just one -installation behavior. Usually such OSes try to share as much as possible -through symlinks, but this still causes complications on the user side, -especially when multiple parameters must be cross-matched. - -For example today if you want to install debian etch, lenny or squeeze you -probably need to install the debootstrap OS multiple times, changing its -configuration file, and calling it debootstrap-etch, debootstrap-lenny or -debootstrap-squeeze. Furthermore if you have for example a "server" and a -"development" environment which installs different packages/configuration files -and must be available for all installs you'll probably end up with -deboostrap-etch-server, debootstrap-etch-dev, debootrap-lenny-server, -debootstrap-lenny-dev, etc. Crossing more than two parameters quickly becomes -not manageable. - -In order to avoid this we plan to make OSes more customizable, by allowing each -OS to declare a list of variants which can be used to customize it. The -variants list is mandatory and must be written, one variant per line, in the -new "variants.list" file inside the main os dir. At least one supported variant -must be supported. When choosing the OS exactly one variant will have to be -specified, and will be encoded in the os name as <OS-name>+<variant>. As for -today it will be possible to change an instance's OS at creation or install -time. +Currently we are assisting to some degree of "os proliferation" just to +change a simple installation behavior. This means that the same OS gets +installed on the cluster multiple times, with different names, to +customize just one installation behavior. Usually such OSes try to share +as much as possible through symlinks, but this still causes +complications on the user side, especially when multiple parameters must +be cross-matched. + +For example today if you want to install debian etch, lenny or squeeze +you probably need to install the debootstrap OS multiple times, changing +its configuration file, and calling it debootstrap-etch, +debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for +example a "server" and a "development" environment which installs +different packages/configuration files and must be available for all +installs you'll probably end up with deboostrap-etch-server, +debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev, +etc. Crossing more than two parameters quickly becomes not manageable. + +In order to avoid this we plan to make OSes more customizable, by +allowing each OS to declare a list of variants which can be used to +customize it. The variants list is mandatory and must be written, one +variant per line, in the new "variants.list" file inside the main os +dir. At least one supported variant must be supported. When choosing the +OS exactly one variant will have to be specified, and will be encoded in +the os name as <OS-name>+<variant>. As for today it will be possible to +change an instance's OS at creation or install time. The 2.1 OS list will be the combination of each OS, plus its supported -variants. This will cause the name name proliferation to remain, but at least -the internal OS code will be simplified to just parsing the passed variant, -without the need for symlinks or code duplication. - -Also we expect the OSes to declare only "interesting" variants, but to accept -some non-declared ones which a user will be able to pass in by overriding the -checks ganeti does. This will be useful for allowing some variations to be used -without polluting the OS list (per-OS documentation should list all supported -variants). If a variant which is not internally supported is forced through, -the OS scripts should abort. - -In the future (post 2.1) we may want to move to full fledged parameters all -orthogonal to each other (for example "architecture" (i386, amd64), "suite" -(lenny, squeeze, ...), etc). (As opposed to the variant, which is a single -parameter, and you need a different variant for all the set of combinations you -want to support). In this case we envision the variants to be moved inside of -Ganeti and be associated with lists parameter->values associations, which will -then be passed to the OS. +variants. This will cause the name name proliferation to remain, but at +least the internal OS code will be simplified to just parsing the passed +variant, without the need for symlinks or code duplication. + +Also we expect the OSes to declare only "interesting" variants, but to +accept some non-declared ones which a user will be able to pass in by +overriding the checks ganeti does. This will be useful for allowing some +variations to be used without polluting the OS list (per-OS +documentation should list all supported variants). If a variant which is +not internally supported is forced through, the OS scripts should abort. + +In the future (post 2.1) we may want to move to full fledged parameters +all orthogonal to each other (for example "architecture" (i386, amd64), +"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which +is a single parameter, and you need a different variant for all the set +of combinations you want to support). In this case we envision the +variants to be moved inside of Ganeti and be associated with lists +parameter->values associations, which will then be passed to the OS. IAllocator changes @@ -825,7 +851,8 @@ In this mode, called ``capacity``, given an instance specification and the current cluster state (similar to the ``allocate`` mode), the plugin needs to return: -- how many instances can be allocated on the cluster with that specification +- how many instances can be allocated on the cluster with that + specification - on which nodes these will be allocated (in order) .. vim: set textwidth=72 : diff --git a/doc/glossary.rst b/doc/glossary.rst index 25a01ead19a2a0923c644ace9ddc75140c9c0b26..c771bff8e73e259e5f6c3101de7d43a887d0393d 100644 --- a/doc/glossary.rst +++ b/doc/glossary.rst @@ -16,8 +16,8 @@ Glossary the startup of an instance. OpCode - A data structure encapsulating a basic cluster operation; for example, - start instance, add instance, etc. + A data structure encapsulating a basic cluster operation; for + example, start instance, add instance, etc. PVM Para-virtualization mode, where the virtual machine knows it's being diff --git a/doc/hooks.rst b/doc/hooks.rst index 03de7b9dae87979171822988d73aa0d2924f6045..3c5451723d0d229f2d7218613d0441f77fd8ed4a 100644 --- a/doc/hooks.rst +++ b/doc/hooks.rst @@ -128,8 +128,9 @@ Adds a node to the cluster. OP_REMOVE_NODE ++++++++++++++ -Removes a node from the cluster. On the removed node the hooks are called -during the execution of the operation and not after its completion. +Removes a node from the cluster. On the removed node the hooks are +called during the execution of the operation and not after its +completion. :directory: node-remove :env. vars: NODE_NAME @@ -350,7 +351,8 @@ Cluster operations OP_POST_INIT_CLUSTER ++++++++++++++++++++ -This hook is called via a special "empty" LU right after cluster initialization. +This hook is called via a special "empty" LU right after cluster +initialization. :directory: cluster-init :env. vars: none @@ -360,8 +362,8 @@ This hook is called via a special "empty" LU right after cluster initialization. OP_DESTROY_CLUSTER ++++++++++++++++++ -The post phase of this hook is called during the execution of destroy operation -and not after its completion. +The post phase of this hook is called during the execution of destroy +operation and not after its completion. :directory: cluster-destroy :env. vars: none diff --git a/doc/iallocator.rst b/doc/iallocator.rst index 2e52068fe85dd59296ffd07383c1d395eebbd0d8..67bee744ffac0b43d4f98fb8e1bf6836596f2f58 100644 --- a/doc/iallocator.rst +++ b/doc/iallocator.rst @@ -225,9 +225,10 @@ nodes or ``offline`` flags set. More details about these of node status flags is available in the manpage :manpage:`ganeti(7)`. -.. [*] Note that no run-time data is present for offline or drained nodes; - this means the tags total_memory, reserved_memory, free_memory, total_disk, - free_disk, total_cpus, i_pri_memory and i_pri_up memory will be absent +.. [*] Note that no run-time data is present for offline or drained + nodes; this means the tags total_memory, reserved_memory, + free_memory, total_disk, free_disk, total_cpus, i_pri_memory and + i_pri_up memory will be absent Response message diff --git a/doc/install.rst b/doc/install.rst index fbf227701faeee6160c2b1d37fda3c58f8f1501a..b64b19942596ee3131476101ab3906e097cdb856 100644 --- a/doc/install.rst +++ b/doc/install.rst @@ -108,20 +108,21 @@ and not just *node1*. .. admonition:: Why a fully qualified host name - Although most distributions use only the short name in the /etc/hostname - file, we still think Ganeti nodes should use the full name. The reason for - this is that calling 'hostname --fqdn' requires the resolver library to work - and is a 'guess' via heuristics at what is your domain name. Since Ganeti - can be used among other things to host DNS servers, we don't want to depend - on them as much as possible, and we'd rather have the uname() syscall return - the full node name. - - We haven't ever found any breakage in using a full hostname on a Linux - system, and anyway we recommend to have only a minimal installation on - Ganeti nodes, and to use instances (or other dedicated machines) to run the - rest of your network services. By doing this you can change the - /etc/hostname file to contain an FQDN without the fear of breaking anything - unrelated. + Although most distributions use only the short name in the + /etc/hostname file, we still think Ganeti nodes should use the full + name. The reason for this is that calling 'hostname --fqdn' requires + the resolver library to work and is a 'guess' via heuristics at what + is your domain name. Since Ganeti can be used among other things to + host DNS servers, we don't want to depend on them as much as + possible, and we'd rather have the uname() syscall return the full + node name. + + We haven't ever found any breakage in using a full hostname on a + Linux system, and anyway we recommend to have only a minimal + installation on Ganeti nodes, and to use instances (or other + dedicated machines) to run the rest of your network services. By + doing this you can change the /etc/hostname file to contain an FQDN + without the fear of breaking anything unrelated. Installing The Hypervisor @@ -130,9 +131,9 @@ Installing The Hypervisor **Mandatory** on all nodes. While Ganeti is developed with the ability to modularly run on different -virtualization environments in mind the only two currently useable on a live -system are Xen and KVM. Supported Xen versions are: 3.0.3, 3.0.4 and 3.1. -Supported KVM version are 72 and above. +virtualization environments in mind the only two currently useable on a +live system are Xen and KVM. Supported Xen versions are: 3.0.3, 3.0.4 +and 3.1. Supported KVM version are 72 and above. Please follow your distribution's recommended way to install and set up Xen, or install Xen from the upstream source, if you wish, @@ -140,9 +141,9 @@ following their manual. For KVM, make sure you have a KVM-enabled kernel and the KVM tools. After installing Xen, you need to reboot into your new system. On some -distributions this might involve configuring GRUB appropriately, whereas others -will configure it automatically when you install the respective kernels. For -KVM no reboot should be necessary. +distributions this might involve configuring GRUB appropriately, whereas +others will configure it automatically when you install the respective +kernels. For KVM no reboot should be necessary. .. admonition:: Xen on Debian @@ -315,8 +316,8 @@ them will already be installed on a standard machine. You can use this command line to install all needed packages:: # apt-get install lvm2 ssh bridge-utils iproute iputils-arping \ - python python-pyopenssl openssl python-pyparsing python-simplejson \ - python-pyinotify + python python-pyopenssl openssl python-pyparsing \ + python-simplejson python-pyinotify Setting up the environment for Ganeti ------------------------------------- @@ -326,34 +327,38 @@ Configuring the network **Mandatory** on all nodes. -You can run Ganeti either in "bridge mode" or in "routed mode". In bridge -mode, the default, the instances network interfaces will be attached to a -software bridge running in dom0. Xen by default creates such a bridge at -startup, but your distribution might have a different way to do things, and -you'll definitely need to manually set it up under KVM. +You can run Ganeti either in "bridge mode" or in "routed mode". In +bridge mode, the default, the instances network interfaces will be +attached to a software bridge running in dom0. Xen by default creates +such a bridge at startup, but your distribution might have a different +way to do things, and you'll definitely need to manually set it up under +KVM. Beware that the default name Ganeti uses is ``xen-br0`` (which was used in Xen 2.0) while Xen 3.0 uses ``xenbr0`` by default. The default bridge your Ganeti cluster will use for new instances can be specified at cluster initialization time. -If you want to run in "routing mode" you need to specify that at cluster init -time (using the --nicparam option), and then no bridge will be needed. In -this mode instance traffic will be routed by dom0, instead of bridged. +If you want to run in "routing mode" you need to specify that at cluster +init time (using the --nicparam option), and then no bridge will be +needed. In this mode instance traffic will be routed by dom0, instead of +bridged. -In order to use "routing mode" under Xen, you'll need to change the relevant -parameters in the Xen config file. Under KVM instead, no config change is -necessary, but you still need to set up your network interfaces correctly. +In order to use "routing mode" under Xen, you'll need to change the +relevant parameters in the Xen config file. Under KVM instead, no config +change is necessary, but you still need to set up your network +interfaces correctly. By default, under KVM, the "link" parameter you specify per-nic will -represent, if non-empty, a different routing table name or number to use for -your instances. This allows insulation between different instance groups, -and different routing policies between node traffic and instance traffic. +represent, if non-empty, a different routing table name or number to use +for your instances. This allows insulation between different instance +groups, and different routing policies between node traffic and instance +traffic. -You will need to configure your routing table basic routes and rules outside -of ganeti. The vif scripts will only add /32 routes to your instances, -through their interface, in the table you specified (under KVM, and in the -main table under Xen). +You will need to configure your routing table basic routes and rules +outside of ganeti. The vif scripts will only add /32 routes to your +instances, through their interface, in the table you specified (under +KVM, and in the main table under Xen). .. admonition:: Bridging under Debian @@ -512,8 +517,8 @@ that the hostname used for this must resolve to an IP address reserved **exclusively** for this purpose, and cannot be the name of the first (master) node. -If you want to use a bridge which is not ``xen-br0``, or no bridge at all, use -the --nicparams +If you want to use a bridge which is not ``xen-br0``, or no bridge at +all, use ``--nicparams``. If the bridge name you are using is not ``xen-br0``, use the *-b <BRIDGENAME>* option to specify the bridge name. In this case, you diff --git a/doc/locking.rst b/doc/locking.rst index 358db91bf25159b9e0a235cc73a2a943d65499c3..d484bef06af306244a089d634c16cb51e3987c01 100644 --- a/doc/locking.rst +++ b/doc/locking.rst @@ -11,61 +11,66 @@ It is divided by functional sections Opcode Execution Locking ------------------------ -These locks are declared by Logical Units (LUs) (in cmdlib.py) and acquired by -the Processor (in mcpu.py) with the aid of the Ganeti Locking Library -(locking.py). They are acquired in the following order: - - * BGL: this is the Big Ganeti Lock, it exists for retrocompatibility. New LUs - acquire it in a shared fashion, and are able to execute all toghether - (baring other lock waits) while old LUs acquire it exclusively and can only - execute one at a time, and not at the same time with new LUs. - * Instance locks: can be declared in ExpandNames() or DeclareLocks() by an LU, - and have the same name as the instance itself. They are acquired as a set. - Internally the locking library acquired them in alphabetical order. - * Node locks: can be declared in ExpandNames() or DeclareLocks() by an LU, and - have the same name as the node itself. They are acquired as a set. - Internally the locking library acquired them in alphabetical order. Given - this order it's possible to safely acquire a set of instances, and then the - nodes they reside on. - -The ConfigWriter (in config.py) is also protected by a SharedLock, which is -shared by functions that read the config and acquired exclusively by functions -that modify it. Since the ConfigWriter calls rpc.call_upload_file to all nodes -to distribute the config without holding the node locks, this call must be able -to execute on the nodes in parallel with other operations (but not necessarily -concurrently with itself on the same file, as inside the ConfigWriter this is -called with the internal config lock held. +These locks are declared by Logical Units (LUs) (in cmdlib.py) and +acquired by the Processor (in mcpu.py) with the aid of the Ganeti +Locking Library (locking.py). They are acquired in the following order: + + * BGL: this is the Big Ganeti Lock, it exists for retrocompatibility. + New LUs acquire it in a shared fashion, and are able to execute all + toghether (baring other lock waits) while old LUs acquire it + exclusively and can only execute one at a time, and not at the same + time with new LUs. + * Instance locks: can be declared in ExpandNames() or DeclareLocks() + by an LU, and have the same name as the instance itself. They are + acquired as a set. Internally the locking library acquired them in + alphabetical order. + * Node locks: can be declared in ExpandNames() or DeclareLocks() by an + LU, and have the same name as the node itself. They are acquired as + a set. Internally the locking library acquired them in alphabetical + order. Given this order it's possible to safely acquire a set of + instances, and then the nodes they reside on. + +The ConfigWriter (in config.py) is also protected by a SharedLock, which +is shared by functions that read the config and acquired exclusively by +functions that modify it. Since the ConfigWriter calls +rpc.call_upload_file to all nodes to distribute the config without +holding the node locks, this call must be able to execute on the nodes +in parallel with other operations (but not necessarily concurrently with +itself on the same file, as inside the ConfigWriter this is called with +the internal config lock held. Job Queue Locking ----------------- The job queue is designed to be thread-safe. This means that its public -functions can be called from any thread. The job queue can be called from -functions called by the queue itself (e.g. logical units), but special -attention must be paid not to create deadlocks or an invalid state. +functions can be called from any thread. The job queue can be called +from functions called by the queue itself (e.g. logical units), but +special attention must be paid not to create deadlocks or an invalid +state. -The single queue lock is used from all classes involved in the queue handling. -During development we tried to split locks, but deemed it to be too dangerous -and difficult at the time. Job queue functions acquiring the lock can be safely -called from all the rest of the code, as the lock is released before leaving -the job queue again. Unlocked functions should only be called from job queue -related classes (e.g. in jqueue.py) and the lock must be acquired beforehand. +The single queue lock is used from all classes involved in the queue +handling. During development we tried to split locks, but deemed it to +be too dangerous and difficult at the time. Job queue functions +acquiring the lock can be safely called from all the rest of the code, +as the lock is released before leaving the job queue again. Unlocked +functions should only be called from job queue related classes (e.g. in +jqueue.py) and the lock must be acquired beforehand. -In the job queue worker (``_JobQueueWorker``), the lock must be released before -calling the LU processor. Otherwise a deadlock can occur when log messages are -added to opcode results. +In the job queue worker (``_JobQueueWorker``), the lock must be released +before calling the LU processor. Otherwise a deadlock can occur when log +messages are added to opcode results. Node Daemon Locking ------------------- -The node daemon contains a lock for the job queue. In order to avoid conflicts -and/or corruption when an eventual master daemon or another node daemon is -running, it must be held for all job queue operations +The node daemon contains a lock for the job queue. In order to avoid +conflicts and/or corruption when an eventual master daemon or another +node daemon is running, it must be held for all job queue operations -There's one special case for the node daemon running on the master node. If -grabbing the lock in exclusive fails on startup, the code assumes all checks -have been done by the process keeping the lock. +There's one special case for the node daemon running on the master node. +If grabbing the lock in exclusive fails on startup, the code assumes all +checks have been done by the process keeping the lock. .. vim: set textwidth=72 : diff --git a/doc/rapi.rst b/doc/rapi.rst index 91d12bde5e80308c0dc6a6c04c73badc33ae9244..ee64dec60eab32a8dad025afa0c3d53c9610e3d1 100644 --- a/doc/rapi.rst +++ b/doc/rapi.rst @@ -28,7 +28,8 @@ principle. Generic parameters ------------------ -A few parameter mean the same thing across all resources which implement it. +A few parameter mean the same thing across all resources which implement +it. ``bulk`` ++++++++ @@ -307,8 +308,8 @@ It supports the following commands: ``GET``. Requests detailed information about the instance. An optional parameter, ``static`` (bool), can be set to return only static information from the -configuration without querying the instance's nodes. The result will be a job -id. +configuration without querying the instance's nodes. The result will be +a job id. ``/2/instances/[instance_name]/reboot`` @@ -385,9 +386,9 @@ It supports the following commands: ``POST``. ~~~~~~~~ Takes the parameters ``mode`` (one of ``replace_on_primary``, -``replace_on_secondary``, ``replace_new_secondary`` or ``replace_auto``), -``disks`` (comma separated list of disk indexes), ``remote_node`` and -``iallocator``. +``replace_on_secondary``, ``replace_new_secondary`` or +``replace_auto``), ``disks`` (comma separated list of disk indexes), +``remote_node`` and ``iallocator``. ``/2/instances/[instance_name]/tags`` @@ -586,8 +587,8 @@ Example:: Change the node role. -The request is a string which should be PUT to this URI. The result will be a -job id. +The request is a string which should be PUT to this URI. The result will +be a job id. It supports the ``force`` argument. @@ -601,8 +602,8 @@ Manages storage units on the node. Requests a list of storage units on a node. Requires the parameters ``storage_type`` (one of ``file``, ``lvm-pv`` or ``lvm-vg``) and -``output_fields``. The result will be a job id, using which the result can be -retrieved. +``output_fields``. The result will be a job id, using which the result +can be retrieved. ``/2/nodes/[node_name]/storage/modify`` +++++++++++++++++++++++++++++++++++++++ @@ -612,10 +613,11 @@ Modifies storage units on the node. ``PUT`` ~~~~~~~ -Modifies parameters of storage units on the node. Requires the parameters -``storage_type`` (one of ``file``, ``lvm-pv`` or ``lvm-vg``) and ``name`` (name -of the storage unit). Parameters can be passed additionally. Currently only -``allocatable`` (bool) is supported. The result will be a job id. +Modifies parameters of storage units on the node. Requires the +parameters ``storage_type`` (one of ``file``, ``lvm-pv`` or ``lvm-vg``) +and ``name`` (name of the storage unit). Parameters can be passed +additionally. Currently only ``allocatable`` (bool) is supported. The +result will be a job id. ``/2/nodes/[node_name]/storage/repair`` +++++++++++++++++++++++++++++++++++++++ @@ -625,9 +627,9 @@ Repairs a storage unit on the node. ``PUT`` ~~~~~~~ -Repairs a storage unit on the node. Requires the parameters ``storage_type`` -(currently only ``lvm-vg`` can be repaired) and ``name`` (name of the storage -unit). The result will be a job id. +Repairs a storage unit on the node. Requires the parameters +``storage_type`` (currently only ``lvm-vg`` can be repaired) and +``name`` (name of the storage unit). The result will be a job id. ``/2/nodes/[node_name]/tags`` +++++++++++++++++++++++++++++ diff --git a/doc/security.rst b/doc/security.rst index b17eee3aad7d4d325d43d701aadee67b33744e8a..5e1574692c877dc6c3c55158334328eb1fa81500 100644 --- a/doc/security.rst +++ b/doc/security.rst @@ -12,8 +12,8 @@ you need to be root to run the cluster commands. Host issues ----------- -For a host on which the Ganeti software has been installed, but not joined to a -cluster, there are no changes to the system. +For a host on which the Ganeti software has been installed, but not +joined to a cluster, there are no changes to the system. For a host that has been joined to the cluster, there are very important changes: @@ -65,11 +65,11 @@ nodes: The SSH traffic is protected (after the initial login to a new node) by the cluster-wide shared SSH key. -RPC communication between the master and nodes is protected using SSL/TLS -encryption. Both the client and the server must have the cluster-wide -shared SSL/TLS certificate and verify it when establishing the connection -by comparing fingerprints. We decided not to use a CA to simplify the -key handling. +RPC communication between the master and nodes is protected using +SSL/TLS encryption. Both the client and the server must have the +cluster-wide shared SSL/TLS certificate and verify it when establishing +the connection by comparing fingerprints. We decided not to use a CA to +simplify the key handling. The DRBD traffic is not protected by encryption, as DRBD does not support this. It's therefore recommended to implement host-level @@ -83,20 +83,20 @@ nodes when configuring the device. Master daemon ------------- -The command-line tools to master daemon communication is done via an UNIX -socket, whose permissions are reset to ``0600`` after listening but before -serving requests. This permission-based protection is documented and works on -Linux, but is not-portable; however, Ganeti doesn't work on non-Linux system at -the moment. +The command-line tools to master daemon communication is done via an +UNIX socket, whose permissions are reset to ``0600`` after listening but +before serving requests. This permission-based protection is documented +and works on Linux, but is not-portable; however, Ganeti doesn't work on +non-Linux system at the moment. Remote API ---------- -Starting with Ganeti 2.0, Remote API traffic is encrypted using SSL/TLS by -default. It supports Basic authentication as per RFC2617. +Starting with Ganeti 2.0, Remote API traffic is encrypted using SSL/TLS +by default. It supports Basic authentication as per RFC2617. -Paths for certificate, private key and CA files required for SSL/TLS will -be set at source configure time. Symlinks or command line parameters may -be used to use different files. +Paths for certificate, private key and CA files required for SSL/TLS +will be set at source configure time. Symlinks or command line +parameters may be used to use different files. .. vim: set textwidth=72 :