Commits · 63ea9789fc27df324b957784e21316ed8f3bd33a · itminedu / snf-ganeti

Mar 04, 2011

RAPI: fix evacuate node resource · 63ea9789

Iustin Pop authored 14 years ago


PollJob returns the whole op_results, hence a list of opcode results.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

63ea9789

Feb 03, 2011

backend: Disable compression in export info file · 775b8743

Michael Hanselmann authored 14 years ago


The new import/export infrastructure in Ganeti 2.2 and up handles
compression differently. It no longer writes compressed files to the
destination. Unfortunately changing this behaviour would be non-trivial,
so in the meantime setting “compression = none” will hopefully avoid
some confusion.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

775b8743

Jan 26, 2011

Wait for master to become available on initialization · 3b6b6129

Michael Hanselmann authored 14 years ago


This is analogue to the existing check for a responsive node daemon.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3b6b6129

Start all daemons on cluster initialization · 952d7515

Michael Hanselmann authored 14 years ago


At least ganeti-confd was not started. It got started a few minutes
later by ganeti-watcher. Also move one pylint disable to the effective
line.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

952d7515

Improve option descriptions · 34616379

Michael Hanselmann authored 14 years ago


Also replace hardcoded “xenvg” with constant.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

34616379

Remove two unused variables · 65cb5c4d

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

65cb5c4d

Fix LUOSDiagnose and non-vm_capable nodes · 5ca09268

Iustin Pop authored 14 years ago


This skips non-vm_capable nodes in the OS diagnose search, since such
OSes will not be used anyway on those nodes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5ca09268

Rephrasing two error messages for auto promotion · 8b437a6e

René Nussbaumer authored 14 years ago


Using auto_promote or auto-promote can lead to confusion on using the
user facing interfaces. While auto-promote is fine for CLI it's not for
RAPI and vice-versa. This patch should eliminate this confusion.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8b437a6e

storage: Check that mapper is either used or None · 985e3f77

Iustin Pop authored 14 years ago


This is a followup patch to the one moving GetAllocatable out to
module level.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

985e3f77

Fix bug in “gnt-node list-storage” · 5ae7cd11

Michael Hanselmann authored 14 years ago


LVM PV storage units would always show as allocatable, even when they
weren't. For some reason I have not been able to determine, the function
parsing the attributes (“_GetAllocatable”) was not even called and the
list opcode simply returned the attribute string as the value (e.g.
“a-”).  Removing “@staticmethod” did the trick and then I just moved it
to module level.

A QA test is included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5ae7cd11

Jan 20, 2011

Improve import/export timeout settings · 4478301b

Michael Hanselmann authored 14 years ago


With this patch, the exporting node will retry to connect a few times.
The receiving node will make use of the master's increased timeout (see
previous patch).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4478301b

Increase remote import/export timeout · 8fd2e34c

Michael Hanselmann authored 14 years ago


It's been shown that 60 seconds may not be enough to establish a
connection.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8fd2e34c

Jan 07, 2011

gnt-instance info: Show disk template · b577dac4

Michael Hanselmann authored 14 years ago


The data was already there, but not shown.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b577dac4

Remove unused import from client.gnt_instance · ab92578a

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>

ab92578a

Jan 06, 2011

gnt-instance console: Improve error reporting · 678aa6d3

Michael Hanselmann authored 14 years ago


If the SSH command fails, this will give a more detailed error
message than before.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

678aa6d3

Increase timeout for connection on remote import · eaf7d41d

Michael Hanselmann authored 14 years ago


The source cluster has to shut down an instance before it can be
exported. Doing so can take a while, but the default connection timeout
is only 60 seconds. Adding the shutdown timeout on the receiving cluster
should help.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit dae91d02)

eaf7d41d

Dec 29, 2010

jqueue: Fix cancelling while in waitlock in queue · 30c945d0

Michael Hanselmann authored 14 years ago


Since the recent change to leave jobs in the “waitlock” status (commit
5fd6b694), cancelling a job while it's back in the queue would break.
This patch handles these cases and adds a unittest.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

30c945d0

Dec 20, 2010

cli: Extend message for LUXI timeouts · cd4c86a8

Michael Hanselmann authored 14 years ago


Point out that jobs already submitted continue to run.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

cd4c86a8

Fix timeout handling in LUXI client · 28e3e216

Michael Hanselmann authored 14 years ago


If the socket can't be read in time, it raises “socket.timeout”, for
which there is special handling code. Unfortunately the exception block
was in the wrong order and “socket.error” caught it before.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

28e3e216

Dec 15, 2010

Fix gnt-cluster verify with diskless instances · 4f5c2533

Adeodato Simo authored 14 years ago


`gnt-cluster verify` was failing with KeyError if there was any
diskless instance in the cluster. This was because _CollectDiskInfo()
was not including these instances in the returned dictionary, but they
were expected to be present in LUVerifyCluster.Exec().

With this commit, we ensure that the dictionary returned by _CollectDiskInfo
includes entries for diskless instances as well.

Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4f5c2533

jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694

Michael Hanselmann authored 14 years ago


Iustin Pop reported that a job's file is updated many times while it
waits for locks held by other thread(s). After an investigation it was
concluded that the reason was a design decision for job priorities to
return jobs to the “queued” status if they couldn't acquire all locks.
Changing a jobs' status or priority requires an update to permanent
storage.

In a high-level view this is what happens:
1. Mark as waitlock
2. Write to disk as permanent storage (jobs left in this state by a
   crashing master daemon are resumed on restart)
3. Wait for lock (assume lock is held by another thread)
4. Mark as queued
5. Write to disk again
6. Return to workerpool

Another option originally discussed was to leave the job in the
“waitlock” status. Ignoring priority changes, this is what would happen:
1. If not in waitlock
1.1. Assert state == queued
1.2. Mark as waitlock
1.3. Set start_timestamp
1.4. Write to disk as permanent storage
3. Wait for locks (assume lock is held by another thread)
4. Leave in waitlock
5. Return to workerpool

Now let's assume the lock is released by the other thread:
[…]
3. Wait for locks and get them
4. Assert state == waitlock
5. Set state to running
6. Set exec_timestamp
7. Write to disk

As this change reduces the number of writes from two per lock acquire
attempt to two per opcode and one per priority increase (as happens
after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
the highest priority is reached), here's the patch to implement it.
Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5fd6b694

Dec 09, 2010

Fix disk status verification in LUClusterVerify · d41d07d4

Iustin Pop authored 14 years ago


Commit b8d26c6e added disk status verification, but it has two
(different) bugs for not healthy nodes.

For offline nodes, we don't add at all the disk status to the
instance/node dict, with the result that the instance is not present in
the instdisk dict if all of its nodes are offline. This creates a
KeyError later when we call VerifyInstance with instdisk[instance].

For online nodes, but which don't return a valid disk status, we simply
set the status to None for each disk, but the code in _VerifyInstance
presumes and requires that each status is a valid tuple of length two.

For both these bugs, we redo the instdisk computations to always include
valid data, and we enhance the asserts to check for consistency.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

d41d07d4

Fix rename for file-backed instances · 3721d2fe

Guido Trotter authored 14 years ago


Currently the code wrongly changes the disk logical/physical id
component representing the path from "$storage_dir/$iname/disk$seq" to
"$storage_dir/$iname/disk/$seq" (note the additional slash) breaking the
rename.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3721d2fe

Dec 01, 2010

locking: Clarify message for removed locks · e1137eb6

Michael Hanselmann authored 14 years ago


Just being told that a lock doesn't exist can be confusing. One case
were this happens is when a job (e.g. instance modify) waits for a job
removing the instance (e.g. export with remove).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e1137eb6

impexpd: Disable OpenSSL compression in socat if possible · 29e8788e

Michael Hanselmann authored 14 years ago


This uses an option only available in patched socat versions. More
information is available from the INSTALL update included in this
patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

29e8788e

Nov 30, 2010
- config.py: need explicit %-formatting in errors.OpPrereqError. · c49b0092
  Adeodato Simo authored 14 years ago
```
Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
```
  c49b0092
Nov 18, 2010

Reinstall instance: disallow offline secondaries · 9aacb199

Iustin Pop authored 14 years ago


Currently, reinstallation of a DRBD instance with the secondary node offline does:

node1# gnt-instance reinstall -f instance1
Waiting for job 139053 for instance1...
Thu Nov 18 01:36:09 2010  - WARNING: Could not prepare block device disk/0 on node node3 (is_primary=False, pass=1): Node is marked offline
Thu Nov 18 01:36:09 2010  - WARNING: Could not shutdown block device disk/0 on node node3: Node is marked offline
Job 139053 for instance1 has failed: Failure: command execution error:
Disk consistency error

Since this fails anyway, let's check the secondary nodes, thus
preventing any modifications to the instance (e.g. OS type change):

node1# gnt-instance reinstall -f instance1
Waiting for job 139058 for instance1...
Job 139058 for instance1 has failed: Failure: prerequisites not met for this operation:
error type: wrong_state, error details:
Instance secondary node offline, cannot reinstall: node3

The patch needs modifications to the _CheckNodeOnline function, in order
to display meaningful messages ("Can't use offline node" would be very
confusing for an instance reinstall, since we didn't select a node
manually).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

9aacb199

Fix breakage in OS state modify · e2334900

Iustin Pop authored 14 years ago


I was using the feedback_fn function incorrectly (it doesn't
automatically expand the arguments).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

e2334900

Nov 17, 2010

LUSetClusterParms: fix validation of beparams · 52b783c2

Iustin Pop authored 14 years ago

Since the contents of the dict is validated via the ForceDictType, we can
simply require that it is a dict here. The previous check was wrong, as it was
copied from the HV checks (which also doesn't verify the leaf dict type).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

52b783c2

Nov 11, 2010

Add unittests for TemporaryReservationManager · 28a7318f

Iustin Pop authored 14 years ago


And fix an error message.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

28a7318f

TempReservationManager: Reserved() doesn't work · a7359d91

David Knowles authored 14 years ago


Note: It appears this has been around since the initial checkin of
TemporaryReservationManager. I have no idea what this could break, so
someone else may want to test this more thoroughly.

Signed-off-by: David Knowles <dknowles@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a7359d91

Nov 03, 2010

Fix disk checks in “gnt-cluster verify” · c6a9dffa

Michael Hanselmann authored 14 years ago


Tests have shown that the changes in commit b8d26c6e don't work as
wanted. If any disk wasn't found on the node, all disks located on the
same node would show as faulty. The cause was incorrect exception
handling on the node.

This patch changes the RPC call to return a per-disk success/error
status, avoiding the problem.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Luca Bigliardi <shammash@google.com>

c6a9dffa

Remove shebang line from ganeti.server.* · 69cf3abd

Michael Hanselmann authored 14 years ago


Some of then were forgotten.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

69cf3abd

Nov 01, 2010

Add -s option to gnt-node modify · 4d32c211

Guido Trotter authored 14 years ago


We can now change a nodes' secondary ip.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4d32c211

luxi: disable two lint errors · 2317945a

Guido Trotter authored 14 years ago


This is already disabled for the same type of request a couple of lines
above. The new code was introduced in e986f20c but didn't have the
disables.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

2317945a

Remove private ip mention in error message · 3c34f03f

Guido Trotter authored 14 years ago


There is no "private" ip in Ganeti, we only have primary and secondary
ones. Whether they are public or private is a per-installation detail.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3c34f03f

Add ConfigWriter.GetNodeGroup · 648e4196

Guido Trotter authored 14 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

648e4196

Improve LookupNodeGroup's docstring · 412b3531

Guido Trotter authored 14 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

412b3531

Merge the common options between import and add · eb28ecf6

Guido Trotter authored 14 years ago


The "I always wanted to do this" commit.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

eb28ecf6

Drop the -g shortcut for --vg-name · 04367e70

Guido Trotter authored 14 years ago


Changing the volume group is a lot less frequent than acting on a node
group. As such we drop the "-g" shortcut and require the long option to
be passed. In 2.3 the commands which used to accept the volume group as
"-g" won't have any node group option, so no confusion will arise. Later
on we may pass "-g" as the initial node group name to gnt-cluster init,
although that's not strictly necessary, as modifying it later is always
possible.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

04367e70