- 16 Jul, 2015 17 commits
-
-
Klaus Aehlig authored
Due to the still existing configuration lock, modifications to the configuration can temporarily be impossible. Therefore, most configuration-modifying function return a Boolean indicating whether the change was carried out. Add a utility function to retry that change until it succeeds. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
In this way, the maintenance daemon can update the jobs part of its state, while complying with the requirement that all its state be stored in the configuration (and hence, also sufficiently replicated). Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...so that the maintenance daemon can always access the authoritative version of that list. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Also, back off if a round is bad. This is usually the case, if the communication with some essential daemon failed. In this case, we do not want to put additional load on the system. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...so that the maintenance daemon can query for the interval at which to run. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...thus providing a convenient way to control at which interval the maintenance daemon does its repairs. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...that will be used to set the minimal delay time for the maintenance daemon between rounds. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
While not technically part of the cluster configuration, OpClusterSetParams is the best place to set and modify the maintenance interval of the maintenance daemon. Most likely, there won't be enough tunables to justify a separate Ganeti command. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...so that the user can modify it. As usual, we do so with only a temporary lock acquired by WConfD. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
As per our design, the maintenance daemons stores its state in the configuration; also, various of its aspects are configurable. So add a corresponding configuration object. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
As per our design, the maintenance daemon operates in rounds. To avoid putting to much load on the cluster, the daemon waits a minimal amount of time between those rounds. This time will be configurable, but there is a default value. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...thus avoiding the magic constants 1000000 (number of microseconds in a second) all over the place. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...that takes care of creating and closing the client properly. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...thus avoiding too frequent polling, as suggested by the TODO entry. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
A lot of Ganeti function return a type IO (GenericResult e a) with various failure types e. It is often necessary to combine all those results in a generic ResultT String IO a Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
- 10 Jul, 2015 14 commits
-
-
Oleg Ponomarev authored
During the sequence of moves while cluster balancing the situation on cluster may change (e.g. because of adding new instance or because of instance or node parameters change) and desired moves can become unprofitable. Partly prevent this effect by introducing new hbal option *--avoid-disk-moves=FACTOR* which will admit only profitable enough disk moves. Signed-off-by:
Oleg Ponomarev <onponomarev@gmail.com> Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
For the moment, we wait a fixed amount of time between the runs of the main loop. Later, once we added maintd's state to the configuration, we will wait for all submitted jobs but at least a configurable amount of time. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
As per our design, this task will get a restriction on which nodes it may operate and will return an updated list of nodes not affected and the list of job ids it submitted. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
The maintenance daemon will submit jobs. As per design, they will be indicated as originating from the maintenance daemon in the reason trail. Add a utility function to do this annotation. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
In this way, we can later report which jobs we executed, as, e.g., the maintenance daemon will have to do. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
In this way, they can be reused by the maintenance daemon. Note that the parts of harep living in IO are very specific to the stand-alone--tool approach, where showing messages on stderr and dying on first error are OK. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Add a man page for the newly added maintenance daemon. It describes the general purpose of this daemon and the command-line options. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
This daemon will take over cluster maintenance as per our design document. As it will heavily depend on the monitoring daemon, it will only be enabled (at configure time) if the monitoring daemon is enabled as well. It will also run as the same user and group. In this commit only the plain daemon is added with the only supported request being the question for the supported protocol versions. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
A common operation, also for our HTTP-speaking daemons, is to return a piece of JSON. So factor out a method for this purpose, so that it can be reused later. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
As the new maintenance daemon will also provide an HTTP server, move the generic infrastructure to a utils module, so that it can be shared between the two servers. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...by bumping minor version and resetting the configuration downgrade function. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Support restarting of failed repair events, by allowing unconditional forgetting of a failed event. Also, rename it to maintenance daemon to emphasize that it does more than just coordinating repairs. Signed-off-by:
Andrew King <ahking@google.com> Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Lisa Velden authored
Determine the job file path with qa_utils.MakeNodePath, so that we get the correct path, even for vcluster. Signed-off-by:
Lisa Velden <velden@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
Currently, iterateAlloc tries one guess on the remaining capacity and falls back to small steps if that guess turns out to be too optimistic. In the typical case, that the allocation is bound by memory that initial guess works quite well; however, in some cases other requirements limit the amount of instances allocatable on a cluster. Instead of immediately giving up in this case, try smaller guess-and-verify steps to avoid having to check for global N+1 redundancy too often. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
- 09 Jul, 2015 8 commits
-
-
Petr Pudlak authored
* stable-2.15 (no changes) * stable-2.14 Move _ValidateConfig to the verify.py submodule Fix building of shell command in export Add test showing a bug in location score calculation Bugfix for cluster location score calculation * stable-2.13 Properly get rid of all watcher jobs Move stdout_of to qa_utils Describe --no-verify-disks option in watcher man page Make disk verification optional * stable-2.12 Tell git to ignore tools/ssl-update Use 'exclude_daemons' option for master only Disable superfluous restarting of daemons Add tests exercising the "crashed" state handling Add proper handling of the "crashed" Xen state Handle SSL setup when downgrading Write SSH ports to ssconf files Noded: Consider certificate chain in callback Cluster-keys-replacement: update documentation Backend: Use timestamp as serial no for server cert UPGRADE: add note about 2.12.5 NEWS: Mention issue 1094 man: mention changes in renew-crypto Verify: warn about self-signed client certs Bootstrap: validate SSL setup before starting noded Clean up configuration of curl request Renew-crypto: remove superflous copying of node certs Renew-crypto: propagate verbose and debug option Noded: log the certificate and digest on noded startup QA: reload rapi cert after renew crypto Prepare-node-join: use common functions Renew-crypto: remove dead code Init: add master client certificate to configuration Renew-crypto: rebuild digest map of all nodes Noded: make "bootstrap" a constant node-daemon-setup: generate client certificate tools: Move (Re)GenerateClientCert to common Renew cluster and client certificates together Init: create the master's client cert in bootstrap Renew client certs using ssl_update tool Run functions while (some) daemons are stopped Back up old client.pem files Introduce ssl_update tool x509 function for creating signed certs Add tools/common.py from 2.13 Consider ECDSA in SSH setup Update documentation of watcher and RAPI daemon Watcher: add option for setting RAPI IP When connecting to Metad fails, log the full stack trace Set up the Metad client with allow_non_master Set up the configuration client properly on non-masters Add the 'allow_non_master' option to the WConfd RPC client Add the option to disable master checks to the RPC client Add 'allow_non_master' to the Luxi test transport class too Add 'allow_non_master' to FdTransport for compatibility Properly document all constructor arguments of Transport Allow the Transport class to be used for non-master nodes Don't define the set of all daemons twice * stable-2.11 Fix capitalization of TestCase Trigger renew-crypto on downgrade to 2.11 Conflicts: Makefile.am lib/ssconf.py src/Ganeti/Constants.hs src/Ganeti/Ssconf.hs test/hs/shelltests/htools-hbal.test Resolutions: Makefile.am keep all the Haskell test data files lib/ssconf.py keep the auto-generated list of valid keys from master src/Ganeti/Constants.hs merge the ssconf entry for ssh ports to the list of valid keys src/Ganeti/Ssconf.hs keep the generated list of constructors from master test/hs/shelltests/htools-hbal.test keep all tests Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
* stable-2.14 Move _ValidateConfig to the verify.py submodule Fix building of shell command in export Add test showing a bug in location score calculation Bugfix for cluster location score calculation * stable-2.13 Properly get rid of all watcher jobs Move stdout_of to qa_utils Describe --no-verify-disks option in watcher man page Make disk verification optional * stable-2.12 Tell git to ignore tools/ssl-update Use 'exclude_daemons' option for master only Disable superfluous restarting of daemons Add tests exercising the "crashed" state handling Add proper handling of the "crashed" Xen state Handle SSL setup when downgrading Write SSH ports to ssconf files Noded: Consider certificate chain in callback Cluster-keys-replacement: update documentation Backend: Use timestamp as serial no for server cert UPGRADE: add note about 2.12.5 NEWS: Mention issue 1094 man: mention changes in renew-crypto Verify: warn about self-signed client certs Bootstrap: validate SSL setup before starting noded Clean up configuration of curl request Renew-crypto: remove superflous copying of node certs Renew-crypto: propagate verbose and debug option Noded: log the certificate and digest on noded startup QA: reload rapi cert after renew crypto Prepare-node-join: use common functions Renew-crypto: remove dead code Init: add master client certificate to configuration Renew-crypto: rebuild digest map of all nodes Noded: make "bootstrap" a constant node-daemon-setup: generate client certificate tools: Move (Re)GenerateClientCert to common Renew cluster and client certificates together Init: create the master's client cert in bootstrap Renew client certs using ssl_update tool Run functions while (some) daemons are stopped Back up old client.pem files Introduce ssl_update tool x509 function for creating signed certs Add tools/common.py from 2.13 Consider ECDSA in SSH setup Update documentation of watcher and RAPI daemon Watcher: add option for setting RAPI IP When connecting to Metad fails, log the full stack trace Set up the Metad client with allow_non_master Set up the configuration client properly on non-masters Add the 'allow_non_master' option to the WConfd RPC client Add the option to disable master checks to the RPC client Add 'allow_non_master' to the Luxi test transport class too Add 'allow_non_master' to FdTransport for compatibility Properly document all constructor arguments of Transport Allow the Transport class to be used for non-master nodes Don't define the set of all daemons twice * stable-2.11 Fix capitalization of TestCase Trigger renew-crypto on downgrade to 2.11 Conflicts: lib/backend.py src/Ganeti/HTools/Cluster.hs test/hs/shelltests/htools-hbal.test Resolutions: lib/backend.py keep the improved 2.15 communication mechanism with Metad src/Ganeti/HTools/Cluster.hs propagate changes from [fb0c774b] to .../Cluster/Moves.hs test/hs/shelltests/htools-hbal.test keep tests from both versions Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Oleg Ponomarev authored
Add description of the second common-failure location tag component to the hbal manpage. Signed-off-by:
Oleg Ponomarev <onponomarev@gmail.com> Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Oleg Ponomarev authored
Initial configuration contains the situation in which two DNS providers are located on the nodes sharing the same power source. Hbal should optimize this placement by simple failover. Signed-off-by:
Oleg Ponomarev <onponomarev@gmail.com> Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Oleg Ponomarev authored
According to the design-location document (Improving location awareness) cluster metric is extended by the component - The number of pairs of exclusion tags and common-failure tags where there exist at least two instances with the given exclusion tag with the primary node having the given common-failure tag. Also this patch fixes Statistics.hs test in order to work with new Statistics because the test is broken by the changes in Statistics.hs. Signed-off-by:
Oleg Ponomarev <onponomarev@gmail.com> Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
* stable-2.13 Properly get rid of all watcher jobs Move stdout_of to qa_utils * stable-2.12 Tell git to ignore tools/ssl-update Use 'exclude_daemons' option for master only Disable superfluous restarting of daemons Add tests exercising the "crashed" state handling Add proper handling of the "crashed" Xen state * stable-2.11 Fix capitalization of TestCase Trigger renew-crypto on downgrade to 2.11 Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Petr Pudlak authored
.. in order to get the size of config/__init__ under 3600 lines again. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
* stable-2.13 Describe --no-verify-disks option in watcher man page Make disk verification optional * stable-2.12 Handle SSL setup when downgrading Write SSH ports to ssconf files Noded: Consider certificate chain in callback Cluster-keys-replacement: update documentation Backend: Use timestamp as serial no for server cert UPGRADE: add note about 2.12.5 NEWS: Mention issue 1094 man: mention changes in renew-crypto Verify: warn about self-signed client certs Bootstrap: validate SSL setup before starting noded Clean up configuration of curl request Renew-crypto: remove superflous copying of node certs Renew-crypto: propagate verbose and debug option Noded: log the certificate and digest on noded startup QA: reload rapi cert after renew crypto Prepare-node-join: use common functions Renew-crypto: remove dead code Init: add master client certificate to configuration Renew-crypto: rebuild digest map of all nodes Noded: make "bootstrap" a constant node-daemon-setup: generate client certificate tools: Move (Re)GenerateClientCert to common Renew cluster and client certificates together Init: create the master's client cert in bootstrap Renew client certs using ssl_update tool Run functions while (some) daemons are stopped Back up old client.pem files Introduce ssl_update tool x509 function for creating signed certs Add tools/common.py from 2.13 Consider ECDSA in SSH setup Update documentation of watcher and RAPI daemon Watcher: add option for setting RAPI IP When connecting to Metad fails, log the full stack trace Set up the Metad client with allow_non_master Set up the configuration client properly on non-masters Add the 'allow_non_master' option to the WConfd RPC client Add the option to disable master checks to the RPC client Add 'allow_non_master' to the Luxi test transport class too Add 'allow_non_master' to FdTransport for compatibility Properly document all constructor arguments of Transport Allow the Transport class to be used for non-master nodes Don't define the set of all daemons twice Conflicts: Makefile.am lib/cmdlib/cluster/verify.py lib/config/__init__.py tools/cfgupgrade Resolution: Makefile.am - keep newly added files from both branches lib/cmdlib/cluster/verify.py - propagate relevant changes from/lib/cmdlib/cluster.py to lib/cmdlib/cluster/__init__.py lib/config/__init__.py - include methods added in stable-2.13 - temporarily disable the warning for too many lines tools/cfgupgrade - propagate changes to lib/tools/cfgupgrade.py Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Helga Velroyen <helgav@google.com>
-
- 08 Jul, 2015 1 commit
-
-
Klaus Aehlig authored
* stable-2.12 Tell git to ignore tools/ssl-update Use 'exclude_daemons' option for master only Disable superfluous restarting of daemons Add tests exercising the "crashed" state handling Add proper handling of the "crashed" Xen state * stable-2.11 Fix capitalization of TestCase Trigger renew-crypto on downgrade to 2.11 Conflicts: .gitignore: use all additions Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-