Commits · 53197381d71bae82e8b89fc85aff6fc86574d16a · itminedu / snf-ganeti

Jan 06, 2011

import-export: Improve timeout error reporting · bd275a93

Michael Hanselmann authored 14 years ago


When the source cluster takes too long to create a snapshot, the
destination would time out. Unfortunately no good error message was
written unless debug logging was enabled, not even to the log file. This
will be improved with this patch.

Another patch to be backported from master will hopefully avoid this
situation completely.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

bd275a93

Dec 16, 2010

ensure-dirs: Speed up when using big queues · 196d70fa

Michael Hanselmann authored 14 years ago


The “ensure-dirs” script as included in Ganeti 2.3 is very slow when
working with big queues requiring a change of permissions on many or all
files.

$ find /var/lib/ganeti/queue/ | wc -l
52354

Before this change:
$ time /usr/local/lib/ganeti/ensure-dirs -f
real    16m4.739s

While not adressed in this patch, I'd like to record the overall
ineffiency of the “ensure-dirs” script, even after this change:

$ time /usr/local/lib/ganeti/ensure-dirs -f
real    5m57.362s
[…]
$ strace -e clone,execve -f -c /usr/local/lib/ganeti/ensure-dirs -f
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.08    5.147090          49    104774           clone
 49.92    5.131094          49    104739           execve

More changes will be needed. Just for comparision, a small Python
snippet changing permissions on all files (“ensure-dirs” changes the
owner too):

$ time python -c 'import os; from ganeti import utils;
[os.chmod(i, 0644) for i in
utils.ListVisibleFiles("/var/lib/ganeti/queue/archive/big")]'
real    0m0.605s
[…]

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

196d70fa

Nov 29, 2010

Move “rapi_users” file into separate directory · fdd9ac5b

Michael Hanselmann authored 14 years ago


This reduces the number of notifications in “ganeti-rapi”. Until now it
was notified for every change in …/lib/ganeti and had to check whether
the users file was affected. A symlink is always created in cfgupgrade
to not break tools referring to the old name.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

fdd9ac5b

impexpd: Implement support for IPv6 · 58bb385c

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

58bb385c

Nov 19, 2010

Support timeouts in RunCmd · c74cda62

René Nussbaumer authored 14 years ago


Further investigations have to be done for merging some of these bits
together with import-export daemon which uses similiar logic.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c74cda62

Nov 16, 2010

Move locking.RunningTimeout to utils · 557838c1

René Nussbaumer authored 14 years ago


As we need this functionality in other places than just locking it makes
sense to move it to utils rather than keeping it in locking

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

557838c1

Oct 29, 2010

Make *.in non-executable · 98028e5d

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

98028e5d

Move ganeti-rapi to ganeti.server.rapi · d9c82a4e

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

d9c82a4e

Move ganeti-noded to ganeti.server.noded · 5119f2ec

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

5119f2ec

Move ganeti-confd to ganeti.server.confd · 5c9c0e0e

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

5c9c0e0e

Move ganeti-masterd to ganeti.server.masterd · 29d91329

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

29d91329

Move ganeti-watcher to ganeti.watcher · 9f4bb951

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

9f4bb951

Oct 28, 2010

Add support and checks for version in LUXI · e986f20c

Michael Hanselmann authored 14 years ago


A new constant, LUXI_VERSION, is used to verify the peer's version. The
version is optional, so old(er) clients and servers talking to peers not
supporting it won't break. Example with mismatching library:

$ gnt-instance list
Unhandled Ganeti error: LUXI version mismatch, server 2020000, request
1010000

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e986f20c

LUClusterVerify: Complain if disk is marked faulty · b8d26c6e

Michael Hanselmann authored 14 years ago


This will show a warning if, for example, one side of a DRBD
disk becomes unavailable. The data is collected separately
from the other verification data.

Example output:

* Verifying instance status
 - ERROR: instance inst1: disk/0 on node2 is faulty

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b8d26c6e

Oct 26, 2010

Adding RPC call for blockdev_wipe · 271b7cf9

René Nussbaumer authored 14 years ago


Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

271b7cf9

Oct 14, 2010

Add a new watcher option --ignore-pause · 46c8a6ab

Iustin Pop authored 14 years ago


During cluster maintenance, when the watcher is disabled, it's useful to
run it just once. This is incovenient to do currently, as the watcher
needs to be unpaused, then run, then paused again.

This patch adds an option “--ignore-pause” that can be used to ignore
the cluster-level setting. Also the man page is updated as it was
missing the options available.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

46c8a6ab

Oct 13, 2010

Fix compatibility with Pyinotify 0.8 · ac96953d

Michael Hanselmann authored 14 years ago


I didn't know why the code previously used
“pyinotify.EventsCodes.ALL_FLAGS” instead of using the flags from
“pyinotify.EventsCodes” directly. Turns out that Pyinotify 0.8 has them
in “pyinotify”, not “pyinotify.EventsCodes”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ac96953d

ganeti-rapi: Watch directory, not file for user file changes · 073c31a5

Michael Hanselmann authored 14 years ago


We noticed several issues when just watching the file, among them race
conditions upon replacing the file using rename(2) (the new watcher
would be created too soon). By just watching the directory for events on
the rapi_users file, this can be avoided.

A nice side-effect is that now the users file is also reloaded if it
didn't exist upon ganeti-rapi's start (see the documentation update).

Since ganeti-rapi now becomes active for virtually every change in the
configuration directory (…/lib/ganeti), moving the rapi_users file to a
separate directory will be considered. It doesn't have to happen in or
before this patch, though.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

073c31a5

http.auth.ReadPasswordFile: Don't read file directly · 2287b920

Michael Hanselmann authored 14 years ago


Reading the file before this function allows for better error
reporting.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2287b920

"Fix" handling of old software versions on startup · 4b63dc7a

Iustin Pop authored 14 years ago


Currently, masterd startup with old software versions is very confusing
for users: we present two tracebacks, with a message in the middle about
"version mismatch". This can lead to users believing that all that needs
to be done is to fix the config file.

This patch attempts to improve this by handling this case in masterd
itself (not in the child), and showing a more friendly message for this
case.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4b63dc7a

Oct 07, 2010

Convert ganeti daemons to the three-stage startup · 3ee53f1f

Iustin Pop authored 14 years ago


This makes almost all of the daemons show error messages, and not return
until they finished listening on the appropriate sockets.

Masterd is the only one "special", as it doesn't do enough
initialization in the server creation, only later.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3ee53f1f

Change utils.GenericMain protocol · b42ea9ed

Iustin Pop authored 14 years ago


Currently, GenericMain does a two-staged workflow:

- Check, before forking
- then Exec, after forking

This means we don't have any possibility to treat preparation work
(before the daemon is ready for work) different from the actual work.

The patch adds another PreExec function that is run just before Exec,
and which should ensure that the daemon is ready for serving client
before it returns. Its result is then sent as the third argument to
Exec.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b42ea9ed

Sep 24, 2010

jqueue: Use timeout when acquiring locks · 26d3fd2f

Michael Hanselmann authored 14 years ago


As already noted in the design document, an opcode's priority is
increased when the lock(s) can't be acquired within a certain amount of
time, except at the highest priority, where in such a case a blocking
acquire is used.

A unittest is provided. Priorities are not yet used for acquiring the
lock(s)—this will need further changes on mcpu.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

26d3fd2f

Sep 13, 2010

RAPI server: Move user file watching out, update documentation · e4ef4343

Michael Hanselmann authored 14 years ago


This patch moves the code watching the users file into a
a separate class to not mix it with HTTP serving. The users
file is now driven from outside the HTTP server class.

Also the documentation is updated to mention the automatic
reloading.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

e4ef4343

Sep 10, 2010

Update the authentication mapping in RAPI if users file has been updated · a2e60f14

René Nussbaumer authored 14 years ago


Please note: This only works if the file existed upon startup. If the file was
created later, ganeti-rapi has to be restarted.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a2e60f14

Sep 07, 2010

Modify daemon-util to support launching daemons under different user/groups · cbccd9ca
René Nussbaumer authored 14 years ago
```
Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
```
cbccd9ca
Remove utils.EnsureDir as this is done by ensure-dirs.in now · fd346851
René Nussbaumer authored 14 years ago
```
Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
```
fd346851

Partial Revert "Let ganeti-rapi run under a different user/group" · 69d89cb5

René Nussbaumer authored 14 years ago


This partially reverts commit 8b72b05c.

Basically it removes the user involved changes

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

69d89cb5

Sep 06, 2010

Allow ensure-dirs to run partially and skip big file chunks · b370482d

René Nussbaumer authored 14 years ago


The startup of the daemons would take a lot of time otherwise,
also it's not needed to set the permissions of those file over
and over again, because if the daemons are once migrated to the
user they will keep creating the file for that user.

The full run is intended as initial upgrade

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b370482d

Adapt ensure-dirs to accomodate the additional permissions and files · 5224330e

René Nussbaumer authored 14 years ago

Please note that this can and will be improved over time. There are discussions
about automated file generation of ensure-dirs so we can _really_ keep all the
permissions and file ownerships in one place. Because right now they are all
in this file _and_ on every WriteFile call.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5224330e

Sep 02, 2010

Disable the RAPI CA checks in watcher · 34f06005

Iustin Pop authored 14 years ago


Since the RAPI certificate is not necessarily self-signed, and we
currently don't have any configuration variable for the real CA file, we
disable for now the CA checks. This fixes the 'restart RAPI every 5
minutes' problem with non-self-signed certs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

34f06005

Aug 24, 2010

Add simple lock monitor · 19b9ba9a

Michael Hanselmann authored 14 years ago


This patch adds an initial implementation of a lock monitor, accessible
for the user through “gnt-debug locks”. It currently shows all resource
locks: BGL, nodes and instances. Config and job queue locks could be
shown too, but wouldn't be of much help.  The current owner(s) and mode
are also shown.

Showing pending acquires will require further changes on the SharedLock
internals and is not yet implemented.

Example output:
$ gnt-debug locks -o name,mode,owner
Name            Mode      Owner
BGL/BGL         shared    JobQueue19/Job147
instances/inst1 exclusive JobQueue19/Job147
instances/inst2 -         -
instances/inst3 -         -
instances/inst4 -         -
nodes/node1     exclusive JobQueue19/Job147
nodes/node2     exclusive JobQueue19/Job147

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

19b9ba9a

Aug 23, 2010

Add RPC calls to update /etc/hosts · 19ddc57a

René Nussbaumer authored 14 years ago


Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

19ddc57a

Aug 19, 2010

Removing all ssh setup code from the core · e8d61457

René Nussbaumer authored 14 years ago


Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e8d61457

Aug 18, 2010

Support for resolving hostnames to IPv6 addresses · b705c7a6

Manuel Franceschini authored 14 years ago


This patch enables IPv6 name resolution by using socket.getaddrinfo
instead of socket.gethostbyname_ex.

It renames the HostInfo class to Hostname and unifies its use throughout
the code. This is achieved by using static calls where no object is
needed and removes some obsolete code.

For now, we just resolve to IPv4 addresses, but this will change once it
is needed.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b705c7a6

Introduce new IPAddress classes · 8b312c1d

Manuel Franceschini authored 14 years ago


This patch unifies the netutils functions dealing with IP addresses to
three classes:
- IPAddress: Common IP address functionality
- IPv4Address: IPv4 specific functionality
- IPv6address: IPv6-specific functionality

Furthermore it adds methods to check whether an address is a loopback
address, replacing the .startswith("127") for IPv4 and adding IPv6
support.

It also provides the basis for future IPv6 address handling. Methods to
convert IP strings to their corresponding interger values will allow to
canonicalize IPv6 addresses.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8b312c1d

Jul 29, 2010

workerpool: Change signature of AddTask function to not use *args · b2e8a4d9

Michael Hanselmann authored 14 years ago


By changing it to a normal parameter, which must be a sequence, we can
start using keyword parameters.

Before this patch all arguments to “AddTask(self, *args)” were passed as
arguments to the worker's “RunTask” method. Priorities, which should be
optional and will be implemented in a future patch, must be passed as a keyword
parameter. This means “*args” can no longer be used as one can't combine *args
and keyword parameters in a clean way:

>>> def f(name=None, *args):
...   print "%r, %r" % (args, name)
...
>>> f("p1", "p2", "p3", name="thename")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 TypeError: f() got multiple values for keyword argument 'name'

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b2e8a4d9

Jul 26, 2010

masterd: move the IP activation from Exec to Check · 340f4757

Iustin Pop authored 14 years ago


Currently, the master IP activation is done in the Exec function. Since
the original masterd process returns after forking, and Exec is run in
the (grand)child process, this means that after 'ganeti-masterd' has
returned there are still initialization tasks running.

Normally this is not a problem, but in cases where one does quick master
failovers, this creates a race condition which hits the QA scripts
especially hard.

To solve this, and make the startup process cleaner (the system is in
steady state after the command has returned, even though masterd startup
could still fail), we move the IP activation to Check(). This also
allows error messages about the IP activation to be seen on the console.

With this patch enabled, I can no longer reproduce the double-failover
errors, which were occuring before in 4/5 cases.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

340f4757

Move the UsesRPC decorator from cli to rpc · e0e916fe

Iustin Pop authored 14 years ago


This is needed because not just the cli scripts need this decorator, but
the master daemon too (and it already duplicated the code once).

In cli.py we just leave a stub, so that we don't have to modify all the
scripts to import rpc.py.

We then change the master daemon code to reuse this decorator, instead
of duplicating it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

e0e916fe

watcher: smarter handling of instance records · f5116c87

Iustin Pop authored 14 years ago

This patch implements a few changes to the instance handling. First, old
instances which no longer exist on the cluster are removed from the
state file, to keep things clean.

Second, the instance restart counters are reset every 8 hours, since
some error cases might be transient (e.g. networking issues, or machine
temporarily down), and if the problem takes more than 5 restarts but is
not permanent, watcher will not restart the instance. The value of 8
hours is, I think, both conservative (as not to hammer the cluster too
often with restarts) and fast enough to clear semi-transient problems.

And last, if an instance is not restarted due to exhausted retries, this
should be warned, otherwise it's hard to understand why watcher doesn't
want to restart an ERROR_down instance.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f5116c87