- Jan 06, 2011
-
-
Michael Hanselmann authored
When the source cluster takes too long to create a snapshot, the destination would time out. Unfortunately no good error message was written unless debug logging was enabled, not even to the log file. This will be improved with this patch. Another patch to be backported from master will hopefully avoid this situation completely. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 16, 2010
-
-
Michael Hanselmann authored
The “ensure-dirs” script as included in Ganeti 2.3 is very slow when working with big queues requiring a change of permissions on many or all files. $ find /var/lib/ganeti/queue/ | wc -l 52354 Before this change: $ time /usr/local/lib/ganeti/ensure-dirs -f real 16m4.739s While not adressed in this patch, I'd like to record the overall ineffiency of the “ensure-dirs” script, even after this change: $ time /usr/local/lib/ganeti/ensure-dirs -f real 5m57.362s […] $ strace -e clone,execve -f -c /usr/local/lib/ganeti/ensure-dirs -f % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 50.08 5.147090 49 104774 clone 49.92 5.131094 49 104739 execve More changes will be needed. Just for comparision, a small Python snippet changing permissions on all files (“ensure-dirs” changes the owner too): $ time python -c 'import os; from ganeti import utils; [os.chmod(i, 0644) for i in utils.ListVisibleFiles("/var/lib/ganeti/queue/archive/big")]' real 0m0.605s […] Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 29, 2010
-
-
Michael Hanselmann authored
This reduces the number of notifications in “ganeti-rapi”. Until now it was notified for every change in …/lib/ganeti and had to check whether the users file was affected. A symlink is always created in cfgupgrade to not break tools referring to the old name. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 19, 2010
-
-
René Nussbaumer authored
Further investigations have to be done for merging some of these bits together with import-export daemon which uses similiar logic. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Nov 16, 2010
-
-
René Nussbaumer authored
As we need this functionality in other places than just locking it makes sense to move it to utils rather than keeping it in locking Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Oct 29, 2010
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Oct 28, 2010
-
-
Michael Hanselmann authored
A new constant, LUXI_VERSION, is used to verify the peer's version. The version is optional, so old(er) clients and servers talking to peers not supporting it won't break. Example with mismatching library: $ gnt-instance list Unhandled Ganeti error: LUXI version mismatch, server 2020000, request 1010000 Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This will show a warning if, for example, one side of a DRBD disk becomes unavailable. The data is collected separately from the other verification data. Example output: * Verifying instance status - ERROR: instance inst1: disk/0 on node2 is faulty Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Oct 26, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Oct 14, 2010
-
-
Iustin Pop authored
During cluster maintenance, when the watcher is disabled, it's useful to run it just once. This is incovenient to do currently, as the watcher needs to be unpaused, then run, then paused again. This patch adds an option “--ignore-pause” that can be used to ignore the cluster-level setting. Also the man page is updated as it was missing the options available. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Oct 13, 2010
-
-
Michael Hanselmann authored
I didn't know why the code previously used “pyinotify.EventsCodes.ALL_FLAGS” instead of using the flags from “pyinotify.EventsCodes” directly. Turns out that Pyinotify 0.8 has them in “pyinotify”, not “pyinotify.EventsCodes”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
We noticed several issues when just watching the file, among them race conditions upon replacing the file using rename(2) (the new watcher would be created too soon). By just watching the directory for events on the rapi_users file, this can be avoided. A nice side-effect is that now the users file is also reloaded if it didn't exist upon ganeti-rapi's start (see the documentation update). Since ganeti-rapi now becomes active for virtually every change in the configuration directory (…/lib/ganeti), moving the rapi_users file to a separate directory will be considered. It doesn't have to happen in or before this patch, though. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Reading the file before this function allows for better error reporting. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
Currently, masterd startup with old software versions is very confusing for users: we present two tracebacks, with a message in the middle about "version mismatch". This can lead to users believing that all that needs to be done is to fix the config file. This patch attempts to improve this by handling this case in masterd itself (not in the child), and showing a more friendly message for this case. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Oct 07, 2010
-
-
Iustin Pop authored
This makes almost all of the daemons show error messages, and not return until they finished listening on the appropriate sockets. Masterd is the only one "special", as it doesn't do enough initialization in the server creation, only later. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently, GenericMain does a two-staged workflow: - Check, before forking - then Exec, after forking This means we don't have any possibility to treat preparation work (before the daemon is ready for work) different from the actual work. The patch adds another PreExec function that is run just before Exec, and which should ensure that the daemon is ready for serving client before it returns. Its result is then sent as the third argument to Exec. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Sep 24, 2010
-
-
Michael Hanselmann authored
As already noted in the design document, an opcode's priority is increased when the lock(s) can't be acquired within a certain amount of time, except at the highest priority, where in such a case a blocking acquire is used. A unittest is provided. Priorities are not yet used for acquiring the lock(s)—this will need further changes on mcpu. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Sep 13, 2010
-
-
Michael Hanselmann authored
This patch moves the code watching the users file into a a separate class to not mix it with HTTP serving. The users file is now driven from outside the HTTP server class. Also the documentation is updated to mention the automatic reloading. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Sep 10, 2010
-
-
René Nussbaumer authored
Please note: This only works if the file existed upon startup. If the file was created later, ganeti-rapi has to be restarted. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Sep 07, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
René Nussbaumer authored
This partially reverts commit 8b72b05c. Basically it removes the user involved changes Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Sep 06, 2010
-
-
René Nussbaumer authored
The startup of the daemons would take a lot of time otherwise, also it's not needed to set the permissions of those file over and over again, because if the daemons are once migrated to the user they will keep creating the file for that user. The full run is intended as initial upgrade Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
René Nussbaumer authored
Please note that this can and will be improved over time. There are discussions about automated file generation of ensure-dirs so we can _really_ keep all the permissions and file ownerships in one place. Because right now they are all in this file _and_ on every WriteFile call. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Sep 02, 2010
-
-
Iustin Pop authored
Since the RAPI certificate is not necessarily self-signed, and we currently don't have any configuration variable for the real CA file, we disable for now the CA checks. This fixes the 'restart RAPI every 5 minutes' problem with non-self-signed certs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Aug 24, 2010
-
-
Michael Hanselmann authored
This patch adds an initial implementation of a lock monitor, accessible for the user through “gnt-debug locks”. It currently shows all resource locks: BGL, nodes and instances. Config and job queue locks could be shown too, but wouldn't be of much help. The current owner(s) and mode are also shown. Showing pending acquires will require further changes on the SharedLock internals and is not yet implemented. Example output: $ gnt-debug locks -o name,mode,owner Name Mode Owner BGL/BGL shared JobQueue19/Job147 instances/inst1 exclusive JobQueue19/Job147 instances/inst2 - - instances/inst3 - - instances/inst4 - - nodes/node1 exclusive JobQueue19/Job147 nodes/node2 exclusive JobQueue19/Job147 Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 23, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Aug 19, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 18, 2010
-
-
Manuel Franceschini authored
This patch enables IPv6 name resolution by using socket.getaddrinfo instead of socket.gethostbyname_ex. It renames the HostInfo class to Hostname and unifies its use throughout the code. This is achieved by using static calls where no object is needed and removes some obsolete code. For now, we just resolve to IPv4 addresses, but this will change once it is needed. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Manuel Franceschini authored
This patch unifies the netutils functions dealing with IP addresses to three classes: - IPAddress: Common IP address functionality - IPv4Address: IPv4 specific functionality - IPv6address: IPv6-specific functionality Furthermore it adds methods to check whether an address is a loopback address, replacing the .startswith("127") for IPv4 and adding IPv6 support. It also provides the basis for future IPv6 address handling. Methods to convert IP strings to their corresponding interger values will allow to canonicalize IPv6 addresses. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 29, 2010
-
-
Michael Hanselmann authored
By changing it to a normal parameter, which must be a sequence, we can start using keyword parameters. Before this patch all arguments to “AddTask(self, *args)” were passed as arguments to the worker's “RunTask” method. Priorities, which should be optional and will be implemented in a future patch, must be passed as a keyword parameter. This means “*args” can no longer be used as one can't combine *args and keyword parameters in a clean way: >>> def f(name=None, *args): ... print "%r, %r" % (args, name) ... >>> f("p1", "p2", "p3", name="thename") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: f() got multiple values for keyword argument 'name' Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 26, 2010
-
-
Iustin Pop authored
Currently, the master IP activation is done in the Exec function. Since the original masterd process returns after forking, and Exec is run in the (grand)child process, this means that after 'ganeti-masterd' has returned there are still initialization tasks running. Normally this is not a problem, but in cases where one does quick master failovers, this creates a race condition which hits the QA scripts especially hard. To solve this, and make the startup process cleaner (the system is in steady state after the command has returned, even though masterd startup could still fail), we move the IP activation to Check(). This also allows error messages about the IP activation to be seen on the console. With this patch enabled, I can no longer reproduce the double-failover errors, which were occuring before in 4/5 cases. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This is needed because not just the cli scripts need this decorator, but the master daemon too (and it already duplicated the code once). In cli.py we just leave a stub, so that we don't have to modify all the scripts to import rpc.py. We then change the master daemon code to reuse this decorator, instead of duplicating it. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This patch implements a few changes to the instance handling. First, old instances which no longer exist on the cluster are removed from the state file, to keep things clean. Second, the instance restart counters are reset every 8 hours, since some error cases might be transient (e.g. networking issues, or machine temporarily down), and if the problem takes more than 5 restarts but is not permanent, watcher will not restart the instance. The value of 8 hours is, I think, both conservative (as not to hammer the cluster too often with restarts) and fast enough to clear semi-transient problems. And last, if an instance is not restarted due to exhausted retries, this should be warned, otherwise it's hard to understand why watcher doesn't want to restart an ERROR_down instance. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-