Commits · 958d01f8f739093f752bff4af259f3e9bb7ff4c7 · itminedu / snf-ganeti

May 03, 2011

cmdlib: Fix typo, s/nick/NIC/ · 958d01f8

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

958d01f8

A small optimisation in cluster verify · 8dddc5bc

Iustin Pop authored 14 years ago


This removes (count of instances + count of nodes) lock
acquires/releases.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

8dddc5bc

May 02, 2011

A few docstring fixes · 72740756

Iustin Pop authored 14 years ago


At least one generates an epydoc error :)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

72740756

luxi: do not handle KeyboardInterrupt · d143f2c6

Iustin Pop authored 14 years ago


With the current code, it's possible to mistake a ^C for a protocol
error:

node1# gnt-job info 221691
[press ^C]
Unhandled protocol error while talking to the master daemon:
Error while deserializing response:

(and note empty error message).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d143f2c6

Handle EPIPE errors while writing to the terminal · 225e2544

Iustin Pop authored 14 years ago


This handles EPIPE errors in two places: ToStream (to catch logging
done in GenericMain itself) and in GenericMain (to cover also plain
print statements).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

225e2544

Cluster verify: check for missing bridges · 20d317d4

Iustin Pop authored 14 years ago


Currently cluster verify doesn't check for bridge information; the
only checks are done at instance create and failover/migrate
time. This means a cluster that seems healthy will fail creation jobs.

This patch implements a simple verification that all nodes (in the
entire cluster, so doesn't work well for multi-group) have all the
required bridges: the default one plus any instance bridge.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

20d317d4

Apr 29, 2011

TLReplaceDisks: Use implicit loop for dictionary · 29b8eaee
Michael Hanselmann authored 14 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
29b8eaee

Release unneeded locks while replacing disks · 1bee66f3

Michael Hanselmann authored 14 years ago

If an iallocator is used, “gnt-instance replace-disks” would acquire the
locks of all nodes (only the allocator will decide which node to use).
Unfortunately the unneeded locks were not released during the operation,
causing unnecessary delays for other jobs.

This patch changes the LU to release unneeded locks and adds assertions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

1bee66f3

locking: Export “list_owned” from lock manager · 07cba1bc

Michael Hanselmann authored 14 years ago


This is analog to “is_owned” and will be used for assertions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

07cba1bc

gnt-instance: Fix typo in error message · d8d838cb

Michael Hanselmann authored 14 years ago


The iallocator parameter is “-I”, not “-i”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d8d838cb

mlock: fail gracefully if libc.so.6 cannot be loaded · adc523ab

Iustin Pop authored 14 years ago


This allows noded to continue instead of blowing up if the libc major
number changes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

adc523ab

Apr 28, 2011

Allow creating the DRBD metadev in a different VG · 87001920

Iustin Pop authored 14 years ago


This is a simple change to allow specifying a different VG for the
meta device during the creation of instances and addition of disks via
gnt-instance modify.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

87001920

Make _GenerateDRBD8Branch accept different VG names · c260fa25

Iustin Pop authored 14 years ago


This is a small change to make this function take a list of VG names,
instead of a single one.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c260fa25

Fix WriteFile with unicode data · 1d39e245

Iustin Pop authored 14 years ago


Unicode is fun, indeed:

>>> len(buffer("abc"))
3
>>> len(buffer(u"abc"))
12

So we can't pass unicode data to buffer(), as the result will be to
write the in-memory (usually UTF-32) representation to disk.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

1d39e245

Apr 27, 2011

Replace disks: keep the meta device in the same VG · fd09d178

Iustin Pop authored 14 years ago


This patch enhances the multi-VG support in replace disks, by keeping
the meta device in the same VG, as opposed to moving it to the data
device VG (note that we don't have a way to create the meta in a
different VG in the first place, but at least we correctly handle a
custom config).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

fd09d178

Fix for multiple VGs - PlainToDrbd and replace-disks · 88aa7f66

Doug Dumitru authored 14 years ago


Converting an instance from 'plain' to 'drbd'.  The old code would
create the drbd volumes in the default VG and then the renames would
fail.  This fix pulls the plain VG names from the existing volumes and
places it into the new disk template.

Running 'replace-disks' has a similar issue with the new disks going
into the wrong VG and then the rename failing.

Their might be a similar issue with 'recreate-disks', but I actually
have no idea what recreate-disks does, so did not look into it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

88aa7f66

Fix potential data-loss in utils.WriteFile · 437c3e77

Iustin Pop authored 14 years ago


os.write can do incomplete writes, as long as at least some bytes have
been written (like write(2)):

>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
980
>>> os.write(fd, " " * 1300)
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
OSError: [Errno 28] No space left on device

Note that incomplete write that only wrote 980 bytes, before the
exception.

To workaround this, we simply iterate until all data is
written. Unittests could be written by using a parameter instead of
hardcoding os.write and checking for incomplete writes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

437c3e77

Improve error messages in cluster verify/OS · 2db04578

Iustin Pop authored 14 years ago


A few issues in the clarity of the error messages are fixed:

- "ERROR: node node3: OS API version lenny-image": no preposition
  between the parameter type and the OS name, changed to "for
  lenny-image"

- "API version lenny-image differs from reference node node1: 10, 5
  vs. 10, 20, 5, 15": parameters not sorted in display

- "OS variants list lenny-image differs from reference node node1:
  vs. default, i386": empty sets are not clearly delimited, changed to
  add [] around the sets: "node node1: [] vs. [default, i386]"

- "OS parameters lenny-image differs from reference node node1:
  vs. (u'dhcp', u'Whether to enable (yes) or disable (dhcp)')": ugly
  formatting in the OS parameters list, as we used to just "%s" the
  tuple; now it is "reference node node1: [] vs. [dhcp: Whether to
  enable (yes) or disable (dhcp)]"

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

2db04578

Prevent readding of the master node · d833acc6

Iustin Pop authored 14 years ago


This breaks Ganeti in multiple ways. If we don't make the check in
gnt-node itself, then bootstrap.SetupNodeDaemon will restart the
master daemon, making the operation fail:

  node1# gnt-node add --readd node1
  Cannot communicate with the master daemon.
  Is it running and listening for connections?

The check in cmdlib is more of a safety check, as we shouldn't reach
it. If we do (via a bad client), then it will prevent breakage in the
job queue/config handling.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d833acc6

Fix punctuation in an error message · cce6f357

Iustin Pop authored 14 years ago


IIRC we don't use punctuation at the end of error messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

cce6f357

Apr 21, 2011

cli: Fix wrong argument kind for groups · dadf6b7d

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

dadf6b7d

Quote filename in gnt-instance.8 · b5a418aa

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b5a418aa

Apr 20, 2011

Fix typo in LUGroupAssignNodes · 97b40f39

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

97b40f39

gnt-instance info: automatically request locking · 5c097318

Iustin Pop authored 14 years ago


Commit dae661a4 added support for controlling the locking, but it
didn't modify the gnt-instance info code, which leads to this command
always showing:

Wed Apr 20 04:10:48 2011  - WARNING: Non-static data requested, locks
need to be acquired

We simply change gnt-instance to request locks whenever we don't use
the static mode.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5c097318

Document the dependency on OOB for gnt-node power · bee8c465

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

bee8c465

Apr 19, 2011

Fix master IP activation in failover with no-voting · 675e2bf5

Iustin Pop authored 14 years ago


Thanks to net.for.hub@gmail.com for reporting this. The logic in
masterd.CheckMasterd did an early return in case of no_voting, hence
skipping the master IP activation. We just change the ifs to not
return but simply continue through the function.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

675e2bf5

disk wiping: fix bug in chunk size computation · 6e7f0cd9

Iustin Pop authored 14 years ago


The current wipe_chunk_size computation is doing min(int_value,
float_value). For small disks (below 10GiB), the actual formula will
result into the float value being chosen. This results into very
interesting behaviour:

Wiping disk 0, offset 102.4, chunk 102.4
Wiping disk 0, offset 204.8, chunk 102.4
…
Wiping disk 0, offset 921.6, chunk 102.4
Wiping disk 0, offset 1024.0, chunk 1.13686837722e-13

Since these are passed to dd via %d, this will result into the call to
dd specifying offset 1024 and count 0, which will fail.

We just need to enforce conversion to int, in order to not get bitten
by floating point rounding errors.

The patch also reorders some logging messages in order to log the
chunk size.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6e7f0cd9

Fix bug in watcher · a0aa6b49

Michael Hanselmann authored 14 years ago


If “utils.RunParts” were to raise an exception, a log message was
written and the code continued to run. Due to the exception the
“results” variable would not be defined.

Also change the code to log a backtrace (getting an exception is rather
unlikely and having a backtrace is useful) and update one comment.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

a0aa6b49

Apr 14, 2011

Release locks before wiping disks during instance creation · 4a2c0db0

Michael Hanselmann authored 14 years ago


Ganeti 2.3 introduced an optional feature to overwrite an instance's
disks on creation. Unfortunately the code kept all locks while doing the
wipe, slowing down the creation of multiple instances in parallel.

This patch changes the code to wipe the disks only after releasing the
locks.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4a2c0db0

Apr 13, 2011

utils.WriteFile: Close file before renaming · a9d68e40

Michael Hanselmann authored 14 years ago

Issue 154 (http://code.google.com/p/ganeti/issues/detail?id=154

)
reported an “Operation not supported” error when writing instance
exports to a mounted CIFS filesystem. Experimentation showed the error
to only occur when using rename(2) on an opened file. Various references
on the web confirmed this observation. Whether or not the problem occurs
can also depend on the CIFS server implementation. In issue 154 it was
Windows 2008 R2.

While not solving all cases, closing the file before renaming helps
alleviating the issue a bit. Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a9d68e40

Fix distcheck · 154d7ba5

Michael Hanselmann authored 14 years ago


README is not copied to the build tree.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

154d7ba5

Nicer formatting for group query error · accbf5e3

Michael Hanselmann authored 14 years ago


Before this patc the message would look like “Some groups do not exist:
[u'foo', u'bar']”, now it's “Some groups do not exist: foo, bar”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

accbf5e3

gnt-instance.8: Fix wrongly formatted title · 69d1b79d

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

69d1b79d

Apr 08, 2011

Update version in README · 9488fd1d

Michael Hanselmann authored 14 years ago


Also add a check to Makefile's check-local target.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9488fd1d

Apr 07, 2011

Merge branch 'stable-2.4' into devel-2.4 · 76ae1d65

Michael Hanselmann authored 14 years ago


* stable-2.4:
  Add error checking and merging for cluster params
  Clarify --force-join parameter message
  Treat empty oob_program param as default
  Fix bug in instance listing with orphan instances
  Fix bug related to log opening failures
  Bump version for 2.4.1 release
  cfgupgrade: Fix critical bug overwriting RAPI users file

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

76ae1d65

Apr 06, 2011

LUInstanceQueryData: Don't acquire locks unless requested · dae661a4

Michael Hanselmann authored 14 years ago


Until now LUInstanceQueryData always acquired locks for the instance(s)
and nodes involved. In combination with long-running operations this
prevented the use of “gnt-instance info”, even with the “--static”
option. With this patch, locks are only acquired when explicitely
requested in the opcode (like all query operations).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

dae661a4

Increase the lock timeouts before we block-acquire · d385a174

Iustin Pop authored 14 years ago


This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d385a174

Apr 04, 2011

daemon.py: move startup log message before prep_fn · fe295df3

Iustin Pop authored 14 years ago


Before this, the output in the rapi daemon log was:
2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
at /var/lib/ganeti/rapi/users
2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
startup

Which is confusing, as it might look like the read of the users file
is part of the previous run. This is because we log the 'daemon
startup' message after the prepare_fn, which can log things on its
own.

The patch simply moves the 'daemon startup' message just before
prepare_fn call.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

fe295df3

Display the actual memory values in N+1 failures · 0942620b

Iustin Pop authored 14 years ago


This changes the display from:
Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail

To:

Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail (33536MiB needed,
27910MiB available)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

0942620b

Mar 31, 2011

ssh.VerifyNodeHostname: remove the quiet flag · ebcd61bb

Iustin Pop authored 14 years ago


This is not needed for this function, and can interfere with debugging
of ssh failures.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

ebcd61bb