Commits · d385a1744c144052eaade85c38dd7106d9abf371 · itminedu / snf-ganeti

Apr 06, 2011

Increase the lock timeouts before we block-acquire · d385a174

Iustin Pop authored 13 years ago


This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d385a174

Apr 04, 2011

daemon.py: move startup log message before prep_fn · fe295df3

Iustin Pop authored 13 years ago


Before this, the output in the rapi daemon log was:
2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
at /var/lib/ganeti/rapi/users
2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
startup

Which is confusing, as it might look like the read of the users file
is part of the previous run. This is because we log the 'daemon
startup' message after the prepare_fn, which can log things on its
own.

The patch simply moves the 'daemon startup' message just before
prepare_fn call.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

fe295df3

Display the actual memory values in N+1 failures · 0942620b

Iustin Pop authored 13 years ago


This changes the display from:
Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail

To:

Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail (33536MiB needed,
27910MiB available)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

0942620b

Mar 31, 2011

ssh.VerifyNodeHostname: remove the quiet flag · ebcd61bb

Iustin Pop authored 13 years ago


This is not needed for this function, and can interfere with debugging
of ssh failures.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

ebcd61bb

Mar 24, 2011

RAPI: Document need for Content-type header in requests · 66287fa8

Michael Hanselmann authored 14 years ago


This was added to the NEWS file in commit ab221ddf, but never
documented properly.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

66287fa8

Fix output for “gnt-job info” · d1b47b16

Michael Hanselmann authored 14 years ago


If the result of an opcode was a non-empty dictionary, it
would be impossible to differenciate between input and result:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Expected output:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
  Result:
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d1b47b16

Mar 17, 2011

watcher: Fix misleading usage output · f0a80b01

Michael Hanselmann authored 14 years ago


When “ganeti-watcher” is called with an argument, it would hint at
a non-existing “-f” parameter. With this patch the separate usage
string is no longer necessary.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f0a80b01

Mar 16, 2011

locking: Fix race condition in lock monitor · e4e35357

Michael Hanselmann authored 14 years ago


In some rare cases it can happen that a lock is re-created very soon
after deletion, while the old instance hasn't been destructed yet. In
such a case the code would detect a duplicate name and raise an
exception.

We have seen at least one case where this happened during the creation
of many instances. It is not exactly clear how it came to be, but it
appears to have occurred while different jobs fought for locks with
short timeouts (in the case of instance creation locks are added at this
stage and removed shortly after if not all locks can be acquired).

The issue is fixed by removing the check for duplicate names. To still
guarantee a stable sort order for the lock information as shown by
“gnt-debug locks”, a registration number is recorded for each lock in
the monitor.

A unittest is included to check for the situation.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e4e35357

Mar 15, 2011

utils: Export NiceSortKey function · 7d4da09e

Michael Hanselmann authored 14 years ago


The ability to split a string into a list of strings and integers can be
handy elsewhere and is necessary for sorting query results by names.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit f47941f8)

7d4da09e

Mar 11, 2011

Revert "Only merge nodes that are known to not be offline" · 8864d152

Guido Trotter authored 14 years ago


This reverts commit 288f240f.

That commit was buggy at various levels:
  - broke ssh access to the second cluster, making cluster-merge
    unusable (unless ssh key were previously setup?)
  - filtered away offline nodes from being added to the cluster config
    (wrong, they should be kept, as offline)
  - broke commit-check

The previous commit makes the code work again with what this commit
tried to achieve.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8864d152

cluster-merge: only operate on online nodes · 8697f0fa

Guido Trotter authored 14 years ago


The node list in MergerData is used only to:
  - stop ganeti on the nodes
  - readd the nodes to the cluster
As such offline nodes should be skipped from it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8697f0fa

Mar 10, 2011

Only merge nodes that are known to not be offline · 288f240f

Stephen Shirley authored 14 years ago


Otherwise the readd will fail, breaking the merge.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

288f240f

Mar 07, 2011

Release 2.4.0 · 20203756

Iustin Pop authored 14 years ago


NEWS update and version bump.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

20203756

Merge branch 'devel-2.3' into devel-2.4 · 37aeca89

Iustin Pop authored 14 years ago


* devel-2.3:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]
  RAPI: fix evacuate node resource

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

37aeca89

Small improvement to the ganeti man page · 7ba19f39

Iustin Pop authored 14 years ago


Also specifies the comma-escaping feature.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

7ba19f39

Merge branch 'devel-2.2' into devel-2.3 · 2ca60304

Iustin Pop authored 14 years ago


* devel-2.2:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

2ca60304

Mar 04, 2011

Fix LUClusterRepairDiskSizes and rpc result usage · e50d8807

Iustin Pop authored 14 years ago


This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 043beb38)

e50d8807

Fix RPC mismatch in blockdev_getsize[s] · 4ae52cc6

Iustin Pop authored 14 years ago


Commit 92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).

The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
(cherry picked from commit ccfbbd2d)

4ae52cc6

RAPI: fix evacuate node resource · 63ea9789

Iustin Pop authored 14 years ago


PollJob returns the whole op_results, hence a list of opcode results.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

63ea9789

Mar 02, 2011

Merge remote branch 'stable-2.4' into devel-2.4 · df1f3c62

Guido Trotter authored 14 years ago


* origin/stable-2.4:
  Fix typo in kvm-ifup script
  NEWS: Replace smartquotes, start lines with uppercase
  Update NEWS and release 2.4.0 rc3
  Fix potential data-loss bug in disk wipe routines

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

df1f3c62

Fix typo in kvm-ifup script · 99e92fa0

Michael Hanselmann authored 14 years ago


Reported-by: Bas Tichelaar <bas@30loops.net>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

99e92fa0

Mar 01, 2011

NEWS: Replace smartquotes, start lines with uppercase · df3df936

Michael Hanselmann authored 14 years ago


- Sphinx converts ASCII quotes ("") to smartquotes (“”) automatically
- Sentences or list items start with an uppercase letter
- Changed description of non-verbose “gnt-* list” output slightly

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

df3df936

Feb 28, 2011

Fix LU processor's GetECId · 3ae70d76

Michael Hanselmann authored 14 years ago


The exception was never actually raised.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>

3ae70d76

Update NEWS and release 2.4.0 rc3 · 94b697b0

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

94b697b0

Merge branch 'devel-2.4' into stable-2.4 · de039dd4

Iustin Pop authored 14 years ago


* devel-2.4:
  1-char comment typo fix
  Expand some acronyms, add to glossary
  query_unittest: Fix argument to set()

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

de039dd4

Fix potential data-loss bug in disk wipe routines · 4ecb94d5

Iustin Pop authored 14 years ago


For the 2.4 release, we only add the missing RPC calls. However, this
needs to be fixed properly, by preventing usage of mis-configured
disks.

Also add a bit more logging so that it's directly clear on which node
the wipe is being done.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

4ecb94d5

Feb 25, 2011

1-char comment typo fix · 73f1d185

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

73f1d185

Feb 24, 2011

Expand some acronyms, add to glossary · 3d5ebbf0

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3d5ebbf0

Feb 23, 2011

query_unittest: Fix argument to set() · bacae536

René Nussbaumer authored 14 years ago


Commit e431074f introduced an uncatched bug. This patch fixes this. The
set is expecting a list or iteratable to work on, so it splitted the
provided instance name into a set of characters. This caused the
exp_status never been set and therefore not catched in one assert rule
further below who checks that every status was tested.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

bacae536

Feb 22, 2011
- Fix title of query field containing instance name · f5182ecb
  Michael Hanselmann authored 14 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
  f5182ecb
Feb 21, 2011

Update news and bump version for 2.4.0 rc2 · e41a1c0c
Iustin Pop authored 14 years ago
```
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
```
v2.4.0rc2

e41a1c0c

Merge branch 'devel-2.4' into stable-2.4 · b31393a1

Iustin Pop authored 14 years ago


* devel-2.4: (23 commits)
  Fix pylint warnings
  Change the list formatting to a 'special' chars
  Add support for merging node groups
  Add option to rename groups on conflict
  Fix minor docstring typo
  Fix HV/OS parameter validation on non-vm nodes
  NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes
  NodeQuery: don't query non-vm_capable nodes
  Remove superfluous redundant requirement
  Don't remove master_candidate flag from merged nodes
  Use a consistent ECID base
  listrunner: convert from getopt to optparse
  listrunner: fix agent usage
  Revert "Disable the cluster-merge tool for the moment"
  Fix cluster-merging by not stopping noded
  Fix error msg for instances on offline nodes
  Minor reordering to match param order
  cluster verify and instance disks on offline nodes
  Cluster verify and N+1 warnings for offline nodes
  Handle gnt-instance shutdown --all for empty clusters
  Use gnt-node add --force-join to add foreign nodes
  Add --force-join option to gnt-node add
  Fix iterating over node groups

Of the above commits present in the devel-2.4 branch, only the “Add
--force-join option to gnt-node add” is a potential issue, but this
has been QA-ed successfully. The other fixes are split in three
groups:

- non-core changes (cluster-merge, listrunner)
- trivial fixes (docstrings, etc.)
- bugs that we want fixed

As such, instead of cherry-picking only individual patches, I propose
that we unify stable and devel 2.4 and make a new RC out of the
result.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b31393a1

Feb 18, 2011

Fix pylint warnings · 9b945588

Stephen Shirley authored 14 years ago


- 1 80-char line infraction
- 4 changes in how arguments are passed to logging functions
- 3 pylint disable-msg's because cluster-merge needs to access ganeti
  config internals

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

9b945588

TestRapiInstanceRename use instance name · 0e265161

Guido Trotter authored 14 years ago


Currently the QA rename job wrongly passed the whole info dict to the
client.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

0e265161

Change the list formatting to a 'special' chars · f0b1bafe

Iustin Pop authored 14 years ago


And also enable verbose display via the, well, verbose option. Man
page and tests are updated, and the formatting is moved from 4 if
statements to a data structure.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

f0b1bafe

Add support for merging node groups · 3a969900

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3a969900

Add option to rename groups on conflict · 1a615be0

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

1a615be0

Fix minor docstring typo · fecbc0b6

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

fecbc0b6

Add QA rapi test for instance reinstall · 0220d2cf

Guido Trotter authored 14 years ago


This tests at least the basic case, unfortunately there is no way to
check all possibilities using the provided rapi client, as that will use
the new method unless the cluster doesn't support it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

0220d2cf

RAPI: remove required parameters for reinstall · bd0807fe

Guido Trotter authored 14 years ago


Before c744425f instance reinstall
accepted the "os" and "nostartup" optional query parameters. With that
commit it was changed to allow "os" "start" and "osparams" via body
rather than encoded in the URL. Unfortunately that commit introduced a
bug, which required the "os" parameter to be passed for body requests,
and at least one of "os" or "nostartup" for query request.

This fix makes sure all parameters are optional again.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

bd0807fe