Commits · e014f1d0e6f5188b2827c549a40b4bddacbc50cd · itminedu / snf-ganeti

Oct 22, 2009

KVMHypervisor: configure v6 parameters on nic · e014f1d0

Guido Trotter authored 15 years ago


In routing mode we are tweaking a few parameters on the interface. With
this patch we'll tweak both the v4 and v6 ones.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

e014f1d0

KVMHypervisor: implement instance policy routing · 2c5afffb

Guido Trotter authored 15 years ago

Until now we relied on traffic from instances being policy routed via a
rule based on the instance network. With this change we can enforce it
on the instance interfaces. Since the ip rules survive interface
disappearing and reappearing, we need first to remove leftover rules,
and then to apply the new one, when creating the interface.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

2c5afffb

Adding '--no-ssh-init' option to 'gnt-cluster init'. · b989b9d9

Ken Wehr authored 15 years ago


Allows the initialization of a cluster without the creation or distribution
of SSH key pairs. Includes changes for LeaveCluster and RPC.

Signed-off-by: Ken Wehr <ksw@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b989b9d9

confd: query the pnode of multiple instances at once · 95b487bb

Flavio Silvestrow authored 15 years ago


Signed-off-by: Flavio Silvestrow <flaviops@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

95b487bb

Try to reduce wrong errors in InstanceShutdown · 3782acd7

Iustin Pop authored 15 years ago


In backend.InstanceShutdown(), there is a race condition between
checking that the instance exists and trying to shut it down which
translates sometime in error messages like:

Tue Oct 20 20:08:30 2009 - WARNING: Could not shutdown instance: Failed
to force stop instance instance9: Failed to stop instance instance9:
exited with exit code 1, Error: Domain 'instance9' does not exist.

To fix this, we ignore any hypervisor StopInstance() errors if the
instance doesn't exist anymore, since our purpose (to make the instance
go away) is already accomplished.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3782acd7

Revert breakage introduced in · 7734de0a

Iustin Pop authored 15 years ago


Commit e4e9b806 introduced two problems
in backend.InstanceShutdown():

- first, it reduced the check interval significantly (especially for the
  first few checks); there are very few production VMs that shutdown in
  one second, and while not breaking anything this creates unnecessary
  load for the hypervisor
- second, a wrong test added to the while condition (“not tried_once”)
  means that we only sleep once for an instance, and after that we
  immediately kill it forcefully

These two together means that any instance which is not lucky enough to
finish in roughly 1-1.5 seconds (the time it takes to sleep and verify
again the instance list) will have this happen:

2009-10-21 23:33:46,034:  pid=16634 INFO Called for inst9 w. False/False
2009-10-21 23:33:47,440:  pid=16634 ERROR Shutdown of 'inst9' unsuccessful, forcing
2009-10-21 23:33:47,440:  pid=16634 INFO Called for inst9 w. True/False

The “Called…” are logs from the hypervisor shutdown function. This means
of course that at restart time:

[12775866.644682] EXT3-fs: INFO: recovery required on readonly filesystem.
[12775866.644689] EXT3-fs: write access will be enabled during recovery.
[12775868.533674] kjournald starting.  Commit interval 5 seconds
[12775868.533697] EXT3-fs: sda1: orphan cleanup on readonly fs
[12775868.551797] EXT3-fs: sda1: 12 orphan inodes deleted
[12775868.551803] EXT3-fs: recovery complete.
[12775868.586275] EXT3-fs: mounted filesystem with ordered data mode.

This patch reverts the broken test and changes the sleep to a fixed
duration of five seconds, since it makes no sense to check that often
for shutdown (and after ~20 seconds we anyway reach a stable value of
five seconds).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7734de0a

Xen: Ignore the retry argument in stop instance · 0cf11e68

Iustin Pop authored 15 years ago


Commit 4ad45119 changed the KVM hypervisor to send multiple shutdown
requests to the monitor, but it didn't change this for the Xen
hypervisor. We simply remove the return on retry model, since we do want
to send multiple shutdown signals for both Xen and KVM (even if the
behaviour is not perfect, they should behave the same).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

0cf11e68

Oct 21, 2009

Ensure RpcResult has “payload” attribute · 1645d22d

Michael Hanselmann authored 15 years ago


Also add assertions to avoid missing attributes in the future.
They won't be included in optimized bytecode.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

1645d22d

Oct 20, 2009

Introduce checks for /sys and /proc · 7c0aa8e9

Iustin Pop authored 15 years ago


This patch adds checks for /proc and /sys in cluster verify, since
Ganeti relies on these special filesystems to be mounted.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7c0aa8e9

Oct 19, 2009

Fix serializer unittests · d357f531

Michael Hanselmann authored 15 years ago


Commit d22b2999 broke the serializer unittests with certain
versions of simplejson. This patch removes sort_keys again
and implements a slightly more efficient way of detecting
simplejson functionality. The serializer unittests no longer
use a partially broken mock, but rather a function to convert all
tuples to lists before comparing.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

d357f531

Oct 16, 2009

bootstrap: Factorize HMAC key generation · c008906b

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

c008906b

Make bootstrap._GenerateSelfSignedSslCert public · cd34faf2
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
cd34faf2

serializer: Sort keys in JSON · d22b2999

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d22b2999

Oct 15, 2009

mcpu: Use new timeout class for timeout · a6db1af2

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a6db1af2

locking: Convert pipe condition to new timeout class · f4e673fb
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
f4e673fb

locking.LockSet: Move timeout calculation to separate class · 7e8841bd

Michael Hanselmann authored 15 years ago


This class can also be used by mcpu.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7e8841bd

locking, mcpu: Ensure timeout is always >= 0.0 · b6b87034

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b6b87034

Oct 13, 2009

locking.LockSet: Improve assertions · e4335b5b

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

e4335b5b

locking: Factorize LockSet.acquire · 76e2f08a

Michael Hanselmann authored 15 years ago


By moving the main code of LockSet.acquire to its own function
we reduce the code complexity a bit and clarify the exception
handling.

This also fixes a case where a lock acquire timeout wasn't
handled correctly, leading to obscure error messages.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

76e2f08a

mcpu: Make sure added locks are released on errors · 6f14fc27

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

6f14fc27

opcodes: Add missing shutdown_timeout to OpRemoveInstance · fc1baca9
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
fc1baca9
luxi: Pass socket path directly to exception, not in tuple · 63d96e4c
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
63d96e4c

gnt-* use the correct opcode slot to build opcodes · 4d98c565

Guido Trotter authored 15 years ago


gnt-* scripts were building wrong opcodes for commands which had the
shutdown_timeout slot (due to missing testing after renaming). Fixing.

Also change SHUTDOWN_TIMEOUT_OPT dest field name to "shutdown_timeout":
it was set to "timeout". It would still work that way, but possibly be
confusing.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4d98c565

rapi: fix tag operations · 64246438

Iustin Pop authored 15 years ago


This patch fixes the tag PUT/DELETE operations, and additionally changes
the _Tags_* functions to take only positional and not keyword arguments
(the defaults do not make any sense at all, and they are always called
with all arguments).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

64246438

Add timeout options to other LUs · 17c3f802

Guido Trotter authored 15 years ago


All the LUs that shut down the instance need to be able too pass the
timeout parameter as well.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

17c3f802

cli: add SHUTDOWN_TIMEOUT_OPT · 7e5eaaa8

Guido Trotter authored 15 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7e5eaaa8

Oct 12, 2009

mcpu: Change lock attempt timeout calculation · e3200b18

Michael Hanselmann authored 15 years ago


With this patch all timeouts are pre-calculated. The interface of
the _LockTimeoutStrategy class is also changed a bit; NextAttempt
now returns a new instance.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

e3200b18

Code and docstring style fixes · 69b99987

Michael Hanselmann authored 15 years ago


Found using pylint and epydoc.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

69b99987

mcpu: Improve lock reporting with timeouts · 211b6132

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

211b6132

mcpu: Implement lock timeouts · 407339d0

Michael Hanselmann authored 15 years ago


The timeout is always between ~0.1 and ~10.0 seconds. A small
variation of ±5% is added to prevent different jobs from
fighting each other. After 10 attempts to acquire the locks with
a timeout, a blocking acquire is made.

Lock status reporting will be improved in a separate patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

407339d0

mcpu: Remove unused exclusive_BGL attribute · 6b95b76d

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

6b95b76d

locking.LockSet: Implement acquire timeouts · 5aab242c

Michael Hanselmann authored 15 years ago

The timeout passed to LockSet.acquire() is measured over all lock acquires. If
LockSet.acquire fails to acquire all requested locks within the specified
amount of time, all locks are released again and the acquire fails.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

5aab242c

Oct 09, 2009

Accept shutdown timeout from the user · 6263189c

Guido Trotter authored 15 years ago


Using the new --timeout option:

- gnt-instance shutdown is changed to accept a timeout
- the opcode is changed to hold one
- the LU is changed to optionally get one
- the rpc is changed to carry one
- the backend is changed to take it as a parameter rather than
  hardcoding it in the function

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6263189c

cli: add a timeout option · b5762e2a

Guido Trotter authored 15 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b5762e2a

ChrootManager: clean StopInstance · a2771c83

Guido Trotter authored 15 years ago


Currently it has lots for duplicated code, and internal retries.
Clean it up with the following assumptions:

We'll probably be called more than once.
It is ok to fail to stop, unless we're called with force=True.
If we're called only once, and with force=True it's ok not to run the
chroot "cleanup" script (it's a destroy after all, why should chroots
have more chances than other instances?).

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

a2771c83

KVMHypervisor: use the StopInstance retry feature · 4ad45119

Guido Trotter authored 15 years ago


Since we know StopInstance is going to be called more than once (at
least twice, once with force and once without, but normally quite a lot
more) we don't need our own sleep/loop, and we can just send one monitor
command per call.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

4ad45119

backend.InstanceShutdown: small cleanup · e4e9b806

Guido Trotter authored 15 years ago


1) unhardcode the timeout, abstracting it in a constant
2) Use time.time() rather than hiding the timeout in a range()
3) call hyper.StopInstance multiple times
   -- currently all hypervisors just ignore all calls but once
4) Use hyper.ListInstances() rather than GetInstanceList([hv_name])
   -- it's cheaper :)
5) Change the final message to "forcing" from "using destroy"

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

e4e9b806

Add default instance shutdown timeout constant · 88cd08aa

Guido Trotter authored 15 years ago


It reflects the "current" two minutes we give to the instance.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

88cd08aa

Hypervisors: Add retry= to StopInstance · 07b49e41

Guido Trotter authored 15 years ago


Currently some hypervisors need the stop operations to be retried more
than once, while other ones only do it in one pass. With this change
we'll handle retries outside the hypervisor code, but telling whether
this is the first try or not.

Since this option is not used for now, all hypervisors just return if
called with retry set to on, maintaining the old behavior. Since the
fake hypervisor has an idempotent StopInstance call, we avoid returning
in that case.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

07b49e41

Get rid of utils.CommaJoin · 6915bc28

Guido Trotter authored 15 years ago


- We never remember to use it (5 uses vs 21 " ,".join())
- It's longer to write than " ,".join()
- The added value of the apostrophe in the string is not very much

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6915bc28