Commits · 675e2bf5ed42ef284adb417c5940111b2c727aa7 · itminedu / snf-ganeti

Apr 19, 2011

Fix master IP activation in failover with no-voting · 675e2bf5

Iustin Pop authored 14 years ago


Thanks to net.for.hub@gmail.com for reporting this. The logic in
masterd.CheckMasterd did an early return in case of no_voting, hence
skipping the master IP activation. We just change the ifs to not
return but simply continue through the function.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

675e2bf5

disk wiping: fix bug in chunk size computation · 6e7f0cd9

Iustin Pop authored 14 years ago


The current wipe_chunk_size computation is doing min(int_value,
float_value). For small disks (below 10GiB), the actual formula will
result into the float value being chosen. This results into very
interesting behaviour:

Wiping disk 0, offset 102.4, chunk 102.4
Wiping disk 0, offset 204.8, chunk 102.4
…
Wiping disk 0, offset 921.6, chunk 102.4
Wiping disk 0, offset 1024.0, chunk 1.13686837722e-13

Since these are passed to dd via %d, this will result into the call to
dd specifying offset 1024 and count 0, which will fail.

We just need to enforce conversion to int, in order to not get bitten
by floating point rounding errors.

The patch also reorders some logging messages in order to log the
chunk size.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6e7f0cd9

Fix bug in watcher · a0aa6b49

Michael Hanselmann authored 14 years ago


If “utils.RunParts” were to raise an exception, a log message was
written and the code continued to run. Due to the exception the
“results” variable would not be defined.

Also change the code to log a backtrace (getting an exception is rather
unlikely and having a backtrace is useful) and update one comment.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

a0aa6b49

Apr 14, 2011

Release locks before wiping disks during instance creation · 4a2c0db0

Michael Hanselmann authored 14 years ago


Ganeti 2.3 introduced an optional feature to overwrite an instance's
disks on creation. Unfortunately the code kept all locks while doing the
wipe, slowing down the creation of multiple instances in parallel.

This patch changes the code to wipe the disks only after releasing the
locks.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4a2c0db0

Apr 13, 2011

utils.WriteFile: Close file before renaming · a9d68e40

Michael Hanselmann authored 14 years ago

Issue 154 (http://code.google.com/p/ganeti/issues/detail?id=154

)
reported an “Operation not supported” error when writing instance
exports to a mounted CIFS filesystem. Experimentation showed the error
to only occur when using rename(2) on an opened file. Various references
on the web confirmed this observation. Whether or not the problem occurs
can also depend on the CIFS server implementation. In issue 154 it was
Windows 2008 R2.

While not solving all cases, closing the file before renaming helps
alleviating the issue a bit. Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a9d68e40

Fix distcheck · 154d7ba5

Michael Hanselmann authored 14 years ago


README is not copied to the build tree.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

154d7ba5

Nicer formatting for group query error · accbf5e3

Michael Hanselmann authored 14 years ago


Before this patc the message would look like “Some groups do not exist:
[u'foo', u'bar']”, now it's “Some groups do not exist: foo, bar”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

accbf5e3

gnt-instance.8: Fix wrongly formatted title · 69d1b79d

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

69d1b79d

Apr 08, 2011

Update version in README · 9488fd1d

Michael Hanselmann authored 14 years ago


Also add a check to Makefile's check-local target.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9488fd1d

Apr 07, 2011

Merge branch 'stable-2.4' into devel-2.4 · 76ae1d65

Michael Hanselmann authored 14 years ago


* stable-2.4:
  Add error checking and merging for cluster params
  Clarify --force-join parameter message
  Treat empty oob_program param as default
  Fix bug in instance listing with orphan instances
  Fix bug related to log opening failures
  Bump version for 2.4.1 release
  cfgupgrade: Fix critical bug overwriting RAPI users file

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

76ae1d65

Apr 06, 2011

LUInstanceQueryData: Don't acquire locks unless requested · dae661a4

Michael Hanselmann authored 14 years ago


Until now LUInstanceQueryData always acquired locks for the instance(s)
and nodes involved. In combination with long-running operations this
prevented the use of “gnt-instance info”, even with the “--static”
option. With this patch, locks are only acquired when explicitely
requested in the opcode (like all query operations).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

dae661a4

Increase the lock timeouts before we block-acquire · d385a174

Iustin Pop authored 14 years ago


This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d385a174

Apr 04, 2011

daemon.py: move startup log message before prep_fn · fe295df3

Iustin Pop authored 14 years ago


Before this, the output in the rapi daemon log was:
2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
at /var/lib/ganeti/rapi/users
2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
startup

Which is confusing, as it might look like the read of the users file
is part of the previous run. This is because we log the 'daemon
startup' message after the prepare_fn, which can log things on its
own.

The patch simply moves the 'daemon startup' message just before
prepare_fn call.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

fe295df3

Display the actual memory values in N+1 failures · 0942620b

Iustin Pop authored 14 years ago


This changes the display from:
Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail

To:

Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail (33536MiB needed,
27910MiB available)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

0942620b

Mar 31, 2011

ssh.VerifyNodeHostname: remove the quiet flag · ebcd61bb

Iustin Pop authored 14 years ago


This is not needed for this function, and can interfere with debugging
of ssh failures.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

ebcd61bb

Mar 28, 2011

Add error checking and merging for cluster params · a6c8fd10

Stephen Shirley authored 14 years ago


Set the default stderr logging level to WARNING so the relevant output
can be seen.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

a6c8fd10

Mar 24, 2011

RAPI: Document need for Content-type header in requests · 66287fa8

Michael Hanselmann authored 14 years ago


This was added to the NEWS file in commit ab221ddf, but never
documented properly.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

66287fa8

Fix output for “gnt-job info” · d1b47b16

Michael Hanselmann authored 14 years ago


If the result of an opcode was a non-empty dictionary, it
would be impossible to differenciate between input and result:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Expected output:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
  Result:
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d1b47b16

Mar 17, 2011

watcher: Fix misleading usage output · f0a80b01

Michael Hanselmann authored 14 years ago


When “ganeti-watcher” is called with an argument, it would hint at
a non-existing “-f” parameter. With this patch the separate usage
string is no longer necessary.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f0a80b01

Clarify --force-join parameter message · 50769215

Stephen Shirley authored 14 years ago


This isn't only used during cluster merge.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

50769215

Mar 16, 2011

locking: Fix race condition in lock monitor · e4e35357

Michael Hanselmann authored 14 years ago


In some rare cases it can happen that a lock is re-created very soon
after deletion, while the old instance hasn't been destructed yet. In
such a case the code would detect a duplicate name and raise an
exception.

We have seen at least one case where this happened during the creation
of many instances. It is not exactly clear how it came to be, but it
appears to have occurred while different jobs fought for locks with
short timeouts (in the case of instance creation locks are added at this
stage and removed shortly after if not all locks can be acquired).

The issue is fixed by removing the check for duplicate names. To still
guarantee a stable sort order for the lock information as shown by
“gnt-debug locks”, a registration number is recorded for each lock in
the monitor.

A unittest is included to check for the situation.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e4e35357

Mar 15, 2011

utils: Export NiceSortKey function · 7d4da09e

Michael Hanselmann authored 14 years ago


The ability to split a string into a list of strings and integers can be
handy elsewhere and is necessary for sorting query results by names.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit f47941f8)

7d4da09e

Mar 11, 2011

Revert "Only merge nodes that are known to not be offline" · 8864d152

Guido Trotter authored 14 years ago


This reverts commit 288f240f.

That commit was buggy at various levels:
  - broke ssh access to the second cluster, making cluster-merge
    unusable (unless ssh key were previously setup?)
  - filtered away offline nodes from being added to the cluster config
    (wrong, they should be kept, as offline)
  - broke commit-check

The previous commit makes the code work again with what this commit
tried to achieve.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8864d152

cluster-merge: only operate on online nodes · 8697f0fa

Guido Trotter authored 14 years ago


The node list in MergerData is used only to:
  - stop ganeti on the nodes
  - readd the nodes to the cluster
As such offline nodes should be skipped from it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8697f0fa

Mar 10, 2011

Only merge nodes that are known to not be offline · 288f240f

Stephen Shirley authored 14 years ago


Otherwise the readd will fail, breaking the merge.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

288f240f

Treat empty oob_program param as default · d62ed502

Stephen Shirley authored 14 years ago


There is currently no way to reset oob_program back to its default from
the cmdline, which causes problems for cluster-merge. This patch means
that the following now works:
  gnt-cluster modify --node-parameters oob_program=

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

d62ed502

Fix bug in instance listing with orphan instances · 377972f4

Iustin Pop authored 14 years ago


Nodes can return unknown instances, so we shouldn't use the name as an
index without checking.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

377972f4

Fix bug related to log opening failures · c24e519e

Iustin Pop authored 14 years ago


If opening the log file fails, then we shouldn't attempt to use that
variable.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c24e519e

Mar 09, 2011

Bump version for 2.4.1 release · c199dbae

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

c199dbae

Mar 08, 2011

cfgupgrade: Fix critical bug overwriting RAPI users file · 87c80992

Michael Hanselmann authored 14 years ago


The cfgupgrade tool was designed to be idempotent, that means it could
be run several times and still give produce the correct result. Ganeti
2.4 moved the file containing the RAPI users to a separate directory
(…/lib/ganeti/rapi/users). If it exists, cfgupgrade would automatically
move an existing file from …/lib/ganeti/rapi_users and replace it with a
symlink.

Unfortunately one of the checks for this was incorrect and, when run
multiple times, replaces the users file at the new location with a
symlink created during a previous run.

In addition the “--dry-run” parameter to cfgupgrade was not respected.
Unittests are updated for all these cases.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

87c80992

Mar 07, 2011

Release 2.4.0 · 20203756

Iustin Pop authored 14 years ago


NEWS update and version bump.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

20203756

Merge branch 'devel-2.3' into devel-2.4 · 37aeca89

Iustin Pop authored 14 years ago


* devel-2.3:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]
  RAPI: fix evacuate node resource

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

37aeca89

Small improvement to the ganeti man page · 7ba19f39

Iustin Pop authored 14 years ago


Also specifies the comma-escaping feature.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

7ba19f39

Merge branch 'devel-2.2' into devel-2.3 · 2ca60304

Iustin Pop authored 14 years ago


* devel-2.2:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

2ca60304

Mar 04, 2011

Fix LUClusterRepairDiskSizes and rpc result usage · e50d8807

Iustin Pop authored 14 years ago


This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 043beb38)

e50d8807

Fix RPC mismatch in blockdev_getsize[s] · 4ae52cc6

Iustin Pop authored 14 years ago


Commit 92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).

The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
(cherry picked from commit ccfbbd2d)

4ae52cc6

RAPI: fix evacuate node resource · 63ea9789

Iustin Pop authored 14 years ago


PollJob returns the whole op_results, hence a list of opcode results.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

63ea9789

Mar 02, 2011

Merge remote branch 'stable-2.4' into devel-2.4 · df1f3c62

Guido Trotter authored 14 years ago


* origin/stable-2.4:
  Fix typo in kvm-ifup script
  NEWS: Replace smartquotes, start lines with uppercase
  Update NEWS and release 2.4.0 rc3
  Fix potential data-loss bug in disk wipe routines

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

df1f3c62

Fix typo in kvm-ifup script · 99e92fa0

Michael Hanselmann authored 14 years ago


Reported-by: Bas Tichelaar <bas@30loops.net>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

99e92fa0

Mar 01, 2011

NEWS: Replace smartquotes, start lines with uppercase · df3df936

Michael Hanselmann authored 14 years ago


- Sphinx converts ASCII quotes ("") to smartquotes (“”) automatically
- Sentences or list items start with an uppercase letter
- Changed description of non-verbose “gnt-* list” output slightly

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

df3df936