Commits · 2ca60304aa766b91a5f10d5b80aa7cbbe4fc9398 · itminedu / snf-ganeti

Mar 07, 2011

Merge branch 'devel-2.2' into devel-2.3 · 2ca60304

Iustin Pop authored 14 years ago


* devel-2.2:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

2ca60304

Mar 04, 2011

Fix LUClusterRepairDiskSizes and rpc result usage · e50d8807

Iustin Pop authored 14 years ago


This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 043beb38)

e50d8807

Fix RPC mismatch in blockdev_getsize[s] · 4ae52cc6

Iustin Pop authored 14 years ago


Commit 92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).

The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
(cherry picked from commit ccfbbd2d)

4ae52cc6

RAPI: fix evacuate node resource · 63ea9789

Iustin Pop authored 14 years ago


PollJob returns the whole op_results, hence a list of opcode results.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

63ea9789

Feb 03, 2011

backend: Disable compression in export info file · 775b8743

Michael Hanselmann authored 14 years ago


The new import/export infrastructure in Ganeti 2.2 and up handles
compression differently. It no longer writes compressed files to the
destination. Unfortunately changing this behaviour would be non-trivial,
so in the meantime setting “compression = none” will hopefully avoid
some confusion.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

775b8743

Jan 26, 2011

Wait for master to become available on initialization · 3b6b6129

Michael Hanselmann authored 14 years ago


This is analogue to the existing check for a responsive node daemon.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3b6b6129

Start all daemons on cluster initialization · 952d7515

Michael Hanselmann authored 14 years ago


At least ganeti-confd was not started. It got started a few minutes
later by ganeti-watcher. Also move one pylint disable to the effective
line.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

952d7515

Clarify job processing order in admin guide · f313e7be

Michael Hanselmann authored 14 years ago


The fact that jobs don't necessarily execute in order has been source
for some confusion. Hopefully this update will clarify things.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f313e7be

Improve option descriptions · 34616379

Michael Hanselmann authored 14 years ago


Also replace hardcoded “xenvg” with constant.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

34616379

Remove two unused variables · 65cb5c4d

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

65cb5c4d

Fix LUOSDiagnose and non-vm_capable nodes · 5ca09268

Iustin Pop authored 14 years ago


This skips non-vm_capable nodes in the OS diagnose search, since such
OSes will not be used anyway on those nodes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5ca09268

Rephrasing two error messages for auto promotion · 8b437a6e

René Nussbaumer authored 14 years ago


Using auto_promote or auto-promote can lead to confusion on using the
user facing interfaces. While auto-promote is fine for CLI it's not for
RAPI and vice-versa. This patch should eliminate this confusion.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8b437a6e

storage: Check that mapper is either used or None · 985e3f77

Iustin Pop authored 14 years ago


This is a followup patch to the one moving GetAllocatable out to
module level.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

985e3f77

Fix bug in “gnt-node list-storage” · 5ae7cd11

Michael Hanselmann authored 14 years ago


LVM PV storage units would always show as allocatable, even when they
weren't. For some reason I have not been able to determine, the function
parsing the attributes (“_GetAllocatable”) was not even called and the
list opcode simply returned the attribute string as the value (e.g.
“a-”).  Removing “@staticmethod” did the trick and then I just moved it
to module level.

A QA test is included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5ae7cd11

Jan 20, 2011

Improve import/export timeout settings · 4478301b

Michael Hanselmann authored 14 years ago


With this patch, the exporting node will retry to connect a few times.
The receiving node will make use of the master's increased timeout (see
previous patch).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4478301b

Increase remote import/export timeout · 8fd2e34c

Michael Hanselmann authored 14 years ago


It's been shown that 60 seconds may not be enough to establish a
connection.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8fd2e34c

Jan 14, 2011

Allow burnin with hidden OSes · eddc9815

Guido Trotter authored 14 years ago


burnin is a cluster/testing feature, so it makes sense that a hidden OS
can be used for it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

eddc9815

Jan 12, 2011

last_resort groups prioritized by iallocator plugin · f78f971e

Stephen Shirley authored 14 years ago


Also change language slightly for preferred groups to look better now
that it's repeated.

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f78f971e

Jan 07, 2011

gnt-instance info: Show disk template · b577dac4

Michael Hanselmann authored 14 years ago


The data was already there, but not shown.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b577dac4

Remove unused import from client.gnt_instance · ab92578a

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>

ab92578a

Jan 06, 2011

gnt-instance console: Improve error reporting · 678aa6d3

Michael Hanselmann authored 14 years ago


If the SSH command fails, this will give a more detailed error
message than before.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

678aa6d3

Increase timeout for connection on remote import · eaf7d41d

Michael Hanselmann authored 14 years ago


The source cluster has to shut down an instance before it can be
exported. Doing so can take a while, but the default connection timeout
is only 60 seconds. Adding the shutdown timeout on the receiving cluster
should help.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit dae91d02)

eaf7d41d

import-export: Improve timeout error reporting · bd275a93

Michael Hanselmann authored 14 years ago


When the source cluster takes too long to create a snapshot, the
destination would time out. Unfortunately no good error message was
written unless debug logging was enabled, not even to the log file. This
will be improved with this patch.

Another patch to be backported from master will hopefully avoid this
situation completely.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

bd275a93

cfgupgrade: Remove unused “program” variable · 9d199a65

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

9d199a65

cfgupgrade: Check master name, clarify question · 011974df

Michael Hanselmann authored 14 years ago


- Check hostname and abort if it doesn't match contents of
  “ssconf_master_node”, can be overridden using “--ignore-hostname”
  parameter.
- Clarify confirmation question and don't mention instances anymore.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

011974df

Makefile: Merge build-time reST copying · e8deef56

Michael Hanselmann authored 14 years ago


No need to copy this snippet around, “make” can work harder for us.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e8deef56

Move doc/upgrade.rst to UPGRADE, copy at build-time · 7a03d293

Michael Hanselmann authored 14 years ago


This will allow distributions to install the file as text documentation.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7a03d293

Jan 05, 2011

Import upgrade notes into documentation · 35dd762d

Michael Hanselmann authored 14 years ago

This patch formats the upgrade notes currently in the wiki[1] as reST
and adds them to the documentation.

[1] http://code.google.com/p/ganeti/wiki/UpgradeNotes



Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

35dd762d

Dec 31, 2010

Fix typo in gnt-instance manpage · ab737f24

Michael Hanselmann authored 14 years ago


s/os-name/os-type/. This was reported in issue 133.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ab737f24

Dec 29, 2010

jqueue: Fix cancelling while in waitlock in queue · 30c945d0

Michael Hanselmann authored 14 years ago


Since the recent change to leave jobs in the “waitlock” status (commit
5fd6b694), cancelling a job while it's back in the queue would break.
This patch handles these cases and adds a unittest.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

30c945d0

Dec 20, 2010

cli: Extend message for LUXI timeouts · cd4c86a8

Michael Hanselmann authored 14 years ago


Point out that jobs already submitted continue to run.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

cd4c86a8

Fix timeout handling in LUXI client · 28e3e216

Michael Hanselmann authored 14 years ago


If the socket can't be read in time, it raises “socket.timeout”, for
which there is special handling code. Unfortunately the exception block
was in the wrong order and “socket.error” caught it before.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

28e3e216

Merge branch 'stable-2.3' into devel-2.3 · 43217ac7

Michael Hanselmann authored 14 years ago


* stable-2.3:
  Prepare 2.3.1 release
  Fix disk status verification in LUClusterVerify

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

43217ac7

Prepare 2.3.1 release · bb2dc35a

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

bb2dc35a

Dec 17, 2010

QA: Run cluster-verify as part of all instance tests · d27150a9

Michael Hanselmann authored 14 years ago


“gnt-cluster verify” looks at some per-instance information as well, so
it should be run for each instance type QA tests.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d27150a9

QA: Fix typo and add “not” · 65924a12

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

65924a12

Dec 16, 2010

ensure-dirs: Speed up when using big queues · 196d70fa

Michael Hanselmann authored 14 years ago


The “ensure-dirs” script as included in Ganeti 2.3 is very slow when
working with big queues requiring a change of permissions on many or all
files.

$ find /var/lib/ganeti/queue/ | wc -l
52354

Before this change:
$ time /usr/local/lib/ganeti/ensure-dirs -f
real    16m4.739s

While not adressed in this patch, I'd like to record the overall
ineffiency of the “ensure-dirs” script, even after this change:

$ time /usr/local/lib/ganeti/ensure-dirs -f
real    5m57.362s
[…]
$ strace -e clone,execve -f -c /usr/local/lib/ganeti/ensure-dirs -f
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.08    5.147090          49    104774           clone
 49.92    5.131094          49    104739           execve

More changes will be needed. Just for comparision, a small Python
snippet changing permissions on all files (“ensure-dirs” changes the
owner too):

$ time python -c 'import os; from ganeti import utils;
[os.chmod(i, 0644) for i in
utils.ListVisibleFiles("/var/lib/ganeti/queue/archive/big")]'
real    0m0.605s
[…]

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

196d70fa

Dec 15, 2010

Fix gnt-cluster verify with diskless instances · 4f5c2533

Adeodato Simo authored 14 years ago


`gnt-cluster verify` was failing with KeyError if there was any
diskless instance in the cluster. This was because _CollectDiskInfo()
was not including these instances in the returned dictionary, but they
were expected to be present in LUVerifyCluster.Exec().

With this commit, we ensure that the dictionary returned by _CollectDiskInfo
includes entries for diskless instances as well.

Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4f5c2533

jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694

Michael Hanselmann authored 14 years ago


Iustin Pop reported that a job's file is updated many times while it
waits for locks held by other thread(s). After an investigation it was
concluded that the reason was a design decision for job priorities to
return jobs to the “queued” status if they couldn't acquire all locks.
Changing a jobs' status or priority requires an update to permanent
storage.

In a high-level view this is what happens:
1. Mark as waitlock
2. Write to disk as permanent storage (jobs left in this state by a
   crashing master daemon are resumed on restart)
3. Wait for lock (assume lock is held by another thread)
4. Mark as queued
5. Write to disk again
6. Return to workerpool

Another option originally discussed was to leave the job in the
“waitlock” status. Ignoring priority changes, this is what would happen:
1. If not in waitlock
1.1. Assert state == queued
1.2. Mark as waitlock
1.3. Set start_timestamp
1.4. Write to disk as permanent storage
3. Wait for locks (assume lock is held by another thread)
4. Leave in waitlock
5. Return to workerpool

Now let's assume the lock is released by the other thread:
[…]
3. Wait for locks and get them
4. Assert state == waitlock
5. Set state to running
6. Set exec_timestamp
7. Write to disk

As this change reduces the number of writes from two per lock acquire
attempt to two per opcode and one per priority increase (as happens
after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
the highest priority is reached), here's the patch to implement it.
Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5fd6b694

Improve jqueue unittests · ebb2a2a3

Michael Hanselmann authored 14 years ago


- Verify job file updates
- Ensure queue lock is released while executing opcode

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ebb2a2a3