- Sep 02, 2010
-
-
Iustin Pop authored
Since the RAPI certificate is not necessarily self-signed, and we currently don't have any configuration variable for the real CA file, we disable for now the CA checks. This fixes the 'restart RAPI every 5 minutes' problem with non-self-signed certs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Aug 18, 2010
-
-
Manuel Franceschini authored
This patch enables IPv6 name resolution by using socket.getaddrinfo instead of socket.gethostbyname_ex. It renames the HostInfo class to Hostname and unifies its use throughout the code. This is achieved by using static calls where no object is needed and removes some obsolete code. For now, we just resolve to IPv4 addresses, but this will change once it is needed. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 26, 2010
-
-
Iustin Pop authored
This patch implements a few changes to the instance handling. First, old instances which no longer exist on the cluster are removed from the state file, to keep things clean. Second, the instance restart counters are reset every 8 hours, since some error cases might be transient (e.g. networking issues, or machine temporarily down), and if the problem takes more than 5 restarts but is not permanent, watcher will not restart the instance. The value of 8 hours is, I think, both conservative (as not to hammer the cluster too often with restarts) and fast enough to clear semi-transient problems. And last, if an instance is not restarted due to exhausted retries, this should be warned, otherwise it's hard to understand why watcher doesn't want to restart an ERROR_down instance. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Jul 09, 2010
-
-
Manuel Franceschini authored
This patch moves network utility functions to a dedicated module. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 01, 2010
-
-
Michael Hanselmann authored
Currently the RAPI client uses the urllib2 and httplib modules from Python's standard library. They're used with pyOpenSSL in a very fragile way, and there are known issues when receiving large responses from a RAPI server. By switching to PycURL we leverage the power and stability of the widely-used curl library (libcurl). This brings us much more flexibility than before, and timeouts were easily implemented (something that would have involved a lot of work with the built-in modules). There's one small drawback: Programs using libcurl have to call curl_global_init(3) (available as pycurl.global_init) while exactly one thread is running (e.g. before other threads) and are supposed to call curl_global_cleanup(3) (available as pycurl.global_cleanup) upon exiting. See the manpages for details. A decorator is provided to simplify this. Unittests for the new code are provided, increasing the test coverage of the RAPI client from 74% to 89%. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 30, 2010
-
-
Manuel Franceschini authored
Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jun 03, 2010
-
-
Tom Limoncelli authored
Update ganeti-watcher so that it tests the master's RAPI port with a simple test (in this case GetVersion). If it fails, make one attempt at restarting ganeti-rapi and retest. - daemons/ganeti-watcher: Test rapi and make one attempt at restarting it. - lib/utils.py: add StopDaemon() function. Signed-off-by:
Tom Limoncelli <tlim@google.com> Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Apr 09, 2010
-
-
Iustin Pop authored
Since the actions are potentially destructive, we should try to get a consistent view of the cluster, so it's better to get the most coverage possible. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Apr 08, 2010
-
-
Iustin Pop authored
This patch changes the watcher so that it maintains (on all nodes) the list of instances and DRBD devices by shutting down ones that confd daemons indicate should not be running on this node. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Mar 23, 2010
-
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
If the hooks dir does not exist, do not warn needlessly. This is similar to commit a9b7e346 (for backend.py). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Mar 08, 2010
-
-
Iustin Pop authored
This passes a full burnin with lots of instances, and should be safe as we mostly to join a known root (various constants) to a run-time variable. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Feb 26, 2010
-
-
Michael Hanselmann authored
By opening the lock file early, other programs can lock the state file to prevent ganeti-watcher from restarting daemons. Using the pause feature is inherently prone to race conditions. Before a traceback was logged when the lock file couldn't be acquired. Now it'll be a more friendly message. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
This is going to be used from the nbma repository, to ensure that the nld daemon is running. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
These hooks are run on all nodes, after the "base" daemons are started. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
We're using a separate function for this, as we're going to add some functionality to this feature. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 23, 2010
-
-
Michael Hanselmann authored
If activating disks fails for some reason, the watcher didn't catch the exception. With this patch it's caught and logged. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jan 28, 2010
-
-
Guido Trotter authored
Ganeti-confd should be running on all 2.1 nodes. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 04, 2010
-
-
Iustin Pop authored
In some cases pylint doesn't parse the import correctly, so we add silences; but there are also many cases of unused imports, which we simply remove. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
Of all daemons, only rapi did abort when given argument. None of our daemons use any arguments, but they accepted them blindly. This is a very bad experience for the user. This patch adds checking and exiting in all daemons, in a uniform way. One other option would have been to add a flag to GenericMain (noargs=True). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
This removes unused variables in the rest of the code (outside lib/). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
This patch should have only: - pylint disables - docstring changes - whitespace changes Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
The logging functions expand the arguments themselves, thus it's safer to let them do it rather than manual string formatting. Also re-wraps one comment. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
- Nov 25, 2009
-
-
Iustin Pop authored
This patch removes the quotes from CommaJoin and converts most of the callers (that I could find) to it. Since CommaJoin does str(i) for i in param, we can remove these, thus simplifying slightly a few calls. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Nov 05, 2009
-
-
Michael Hanselmann authored
Until now, Ganeti started and stopped its own daemons using custom functions. To start, the daemon was just executed and then sent the appropriate signals to stop it again. Init scripts would have to pay attention to the PID file and other things. With this patch, a new script is added (“daemon-util”, installed in $prefix/lib/ganeti/), centralizing the starting and stopping of daemons. The provided example init script is adjusted to use this new script. Ganeti's code no longer calls its own init script. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Sep 18, 2009
-
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Aug 26, 2009
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 24, 2009
-
-
Guido Trotter authored
The <DAEMON>_PID constants were created to reference a daemon pid file, but actually contain a daemon's name, because the various functions that work with pidfiles abstract the filename from the daemon name themselves. Removing the constants and using the actual daemon name constants in their place. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- May 25, 2009
-
-
Iustin Pop authored
This patch makes the watcher automatically restart the node and rapi daemons, if they are not running (as per the PID file). This is not an exhaustive test; a better one would be TCP connect to the port, and an even better one a simple protocol ping (e.g. get / for rapi and a rpc_call_alive for noded), but since we don't know how they've been started we can't implement it today. rapi would need to write the SSL/port to a file, and noded something similar, so that we know how to connect. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently the watcher is broken when the queue is full, thus not fulfilling its job as a queue cleaner. It also doesn't handle nicely the queue drained status. This patch does a few changes: - first archive jobs, and only after submit jobs; this fixes the case where the queue is already full and there are jobs suited for archiving (but not the case where the jobs all too young to be archived) - handle nicely the job queue full and drained cases—instead of tracebacks, log such cases nicely - reverse the initial value and special cases for update_file; we now whitelist instead of blacklist cases, since we have much more blacklist cases than vice versa, and we set the flag to True only after the run is successful The last change, especially, is a significant one: now errors during the watcher run will not update the status file, and thus they won't be lost again in the logs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- May 20, 2009
-
-
Iustin Pop authored
This patch modifies the watcher to keep on-disk a file with the instance status; this can be used from outside of ganeti to react to instances being down (when the watcher cannot restart them). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 19, 2009
-
-
Iustin Pop authored
Bugs in either our code or in associated libraries can bring the master daemon down, and this (due to the 2.0 architecture) stops all work on the cluster. Since the watcher already does periodic checks on the cluster, we modify it to try to start the master automatically in case of failures to connect. This will be tried only once per cycle. Also, in this case, we modify the code so that the watcher status file is not updated - its timestamp will reflect thus the time of last successful connection to the master. Side note: the except errors.ConfigurationError part could be cleaned up, since in 2.0 we don't usually get that directly, and if we do it's an error and we shouldn't touch the file anyway; but that is not a rc5 change. Signed-off-by:
Iustin Pop <iustin@google.com>
-
- Apr 06, 2009
-
-
Iustin Pop authored
Currently the watcher spews errors message on non-master nodes. This cleans it up. Reviewed-by: imsnah
-
Iustin Pop authored
As per the mailing list discussion, this patch changes the watcher to use a single job (two opcodes) for getting the cluster state (node list and instance list); it will then compute the needed actions based on this data. The patch also archives this job and the verify-disks job. Reviewed-by: imsnah
-
- Mar 09, 2009
-
-
Iustin Pop authored
Currently, the watcher startup sequence does: - open a luxi client - get the instance list - get the node boot ids - open and lock the status file, and: - archive jobs - restart the down instances - check disks This, of course, can lead to problems when a node is (genuinely or not) locked for more than (watcher interval * maximum query clients) time. At that time, the master is completely unresponsive until the node is unlocked and all the watchers exit with error due to the state file being locked by the first instance. This patch reworks the startup sequence to first open/lock the status file, and only then open a luxi client. This should prevent the above case. Reviewed-by: ultrotter
-
- Feb 24, 2009
-
-
Iustin Pop authored
This patch removes the extra_args parameter and instead switches the instance to the HV_KERNEL_ARGS hypervisor option. This is a big change, but it's a needed cleanup, this extra parameter on all RPC calls is not generic and we also need to have a persistent value here. Reviewed-by: imsnah
-
- Feb 16, 2009
-
-
Iustin Pop authored
The recent change (commit 2151) to the watcher to make it handle offline nodes also saves the offline attribute to the state file, but this is not needed and also breaks the checking of the boot ID. This patch simply removes it, restoring the correct behaviour. Reviewed-by: imsnah
-
Iustin Pop authored
This patch adds auto-archiving of jobs older than 6 hours to the watcher. Reviewed-by: imsnah
-
- Feb 04, 2009
-
-
Iustin Pop authored
This patch adds the framework for, and enables lockless OpQueryInstances. This means that instances will be shown in ERROR_up or ERROR_down state, even though this is not an error (but just an in-progress job). The framework is implemented as follows: - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take an additional “use_locking” flag which will denote whether to lock or not; this patch only implements this for LUQueryInstances - the luxi query functions take an additional argument use_locking which is passed to the master daemon, and then passed to the above opcodes - cli.py export a new SYNC_OPT command line options which implement setting this flag to true - except for gnt-instance list, which uses this option, and for name-only queries (e.g. QueryNodes(fields=["names"])), all other callers are setting this flag to True - RAPI also sets the flag to True The patch was tested with a continuous (0.2s sleep in-between) gnt-instance list during a burnin, and no problems were observed. Reviewed-by: ultrotter
-