Commits · fecbc0b654369d951cedb56c35ef5987ff1fbc4b · itminedu / snf-ganeti

Feb 18, 2011

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

fecbc0b6

Feb 17, 2011

Fix HV/OS parameter validation on non-vm nodes · 9c24736c

Iustin Pop authored 14 years ago


Currently, there is at least one LU that does wrong validation of HV
parameters (against all nodes, LUClusterSetParams). It's possible to
fix this case, but I went and modified the base functions to filter
out non-vm_capable nodes so all callers are protected.

Note: the _CheckOSParams function is never called with all nodes list,
so modifying it shouldn't be needed. However, I think it's safe to do
so (and it shouldn't hurt as an instance's node shouldn't ever lack
the vm_capable bit).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

9c24736c

NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes · effab4ca

Iustin Pop authored 14 years ago


Since we don't have the data per design, UNAVAIL is appropriate here,
while NODATA is not.

The patch also adds a comment: if we extend the live fields list to
contain other data in the future, we need to reevaluate this solution.

This should fix issue 143. The listing now shows (node2==ofline,
node3==not vm_capable):

  Node     DTotal     DFree    MTotal     MNode     MFree Pinst Sinst
  node1    698.6G    630.5G     32.0G      1.0G     30.0G     8     7
  node2 (offline) (offline) (offline) (offline) (offline)     9     4
  node3 (unavail) (unavail) (unavail) (unavail) (unavail)     0     0

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

effab4ca

NodeQuery: don't query non-vm_capable nodes · 74f258b6

Iustin Pop authored 14 years ago


Because non-vm_capable nodes most likely don't have a hypervisor
configured and/or storage, so the call will fail anyway.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

74f258b6

Feb 09, 2011

Fix error msg for instances on offline nodes · 11dcce87

Iustin Pop authored 14 years ago


Currently, for both primary and secondary offline nodes, we give the
same message:
- ERROR: instance instance14: instance lives on offline node(s) node3
- ERROR: instance instance15: instance lives on offline node(s) node3
- ERROR: instance instance16: instance lives on offline node(s) node3
- ERROR: instance instance17: instance lives on offline node(s) node3

This is confusing, as an offline primary is in a different category
than a secondary. The patch changes the warnings to have different
error messages:
- ERROR: instance instance14: instance has offline secondary node(s) node3
- ERROR: instance instance15: instance has offline secondary node(s) node3
- ERROR: instance instance16: instance lives on offline node node3
- ERROR: instance instance17: instance lives on offline node node3

Thanks to Alexander Schreiber <als@google.com> for reporting this
issue.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Alexander Schreiber <als@google.com>

11dcce87

cluster verify and instance disks on offline nodes · a3de2ae7

Iustin Pop authored 14 years ago


Currently, cluster-verify says:

- ERROR: instance instance14: couldn't retrieve status for disk/0 on node3: node offline
- ERROR: instance instance14: instance lives on offline node(s) node3
- ERROR: instance instance15: couldn't retrieve status for disk/0 on node3: node offline
- ERROR: instance instance15: instance lives on offline node(s) node3

This is redundant as the “lives on offline node” message should be all we need to
understand the cluster situation.

The patch fixes this and also corrects a very old idiom.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>

a3de2ae7

Cluster verify and N+1 warnings for offline nodes · f7661f6b

Iustin Pop authored 14 years ago


Currently, cluster verify shows warnings N+1 warnings for offline
nodes having any redundant instances since the memory data that we
have for those nodes is zero, so any instance will trigger the
warning.

As the comment says, we already list secondary instances on offline
nodes, so that warning is enough, and we skip the N+1 one.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>

f7661f6b

Feb 08, 2011

Handle gnt-instance shutdown --all for empty clusters · c37bb2c6

Stephen Shirley authored 14 years ago


The current code gives:
Failure: prerequisites not met for this operation:
error type: wrong_input, error details:
Selection filter does not match any instances

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

c37bb2c6

Feb 04, 2011

Add --force-join option to gnt-node add · 61413377

Stephen Shirley authored 14 years ago


This is needed so cluster-merge can add nodes from other clusters.

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

61413377

Feb 03, 2011

Bump up intra-cluster import connect timeout · 81635b5a

Iustin Pop authored 14 years ago


Currently, the export timeout is 10 times 20 seconds, but the import
is only 30 seconds. I'm raising this to 60 seconds with two goals in
mind:

- when debugging manually, this allows for easier synchronisation of
  the processes
- 60 equals to 3 full 20 second intervals, which I think is better
  than just one an a half

This change shouldn't make a big difference either way (at most, it
will possibly delay the job in case of failures by half a minute).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

81635b5a

Import-export: fix logging of daemon output · c9300bb3

Iustin Pop authored 14 years ago

In case of failures, the recent daemon output is logged as %r on a
list of unicode strings, which results in the (ugly):

Thu Feb 3 05:13:34 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: [u' DUMP: Date of this level 0 dump: Thu Feb 3 05:13:18 2011', u' DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output', u' DUMP: Label: none', u' DUMP: Writing 10 Kilobyte records', u' DUMP: mapping (Pass I) [regular files]', u' DUMP: mapping (Pass II) [directories]', u' DUMP: estimated 54301 blocks.', u' DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:13:19 2011', u' DUMP: dumping (Pass III) [directories]', u' DUMP: dumping (Pass IV) [regular files]', u'socat: E SSL_write(): Connection reset by peer', u"dd: dd: writing `standard output': Broken pipe", u' DUMP: Broken pipe', u' DUMP: The ENTIRE dump is aborted.'])

This patch joins this list and makes it a non-unicode string, thus
resulting in the more readable (and ~10% shorter):

Thu Feb 3 05:16:04 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: DUMP: Date of this level 0 dump: Thu Feb 3 05:15:58 2011\n DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output\n DUMP: Label: none\n DUMP: Writing 10 Kilobyte records\n DUMP: mapping (Pass I) [regular files]\n DUMP: mapping (Pass II) [directories]\n DUMP: estimated 54350 blocks.\n DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:15:59 2011\n DUMP: dumping (Pass III) [directories]\nsocat: E SSL_write(): Connection reset by peer\ndd: dd: writing `standard output': Broken pipe\n DUMP: Broken pipe\n DUMP: The ENTIRE dump is aborted.)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c9300bb3

Fix handling of ^C in the CLI scripts · 8a53b55f

Iustin Pop authored 14 years ago


This adds a message and nice handling of ^C, especially useful for
``gnt-job watch``.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

8a53b55f

backend: Disable compression in export info file · 775b8743

Michael Hanselmann authored 14 years ago


The new import/export infrastructure in Ganeti 2.2 and up handles
compression differently. It no longer writes compressed files to the
destination. Unfortunately changing this behaviour would be non-trivial,
so in the meantime setting “compression = none” will hopefully avoid
some confusion.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

775b8743

Feb 02, 2011

Reopen log files upon SIGHUP in daemons · 8cabf472

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

8cabf472

utils.SetupLogging: Return function to reopen log file · 9a6813ac

Michael Hanselmann authored 14 years ago


This function can be used from a SIGHUP handler to reopen log files.
Initial, simple unittests are included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

9a6813ac

utils.SetupLogging: Make program a mandatory argument · cfcc79c6

Michael Hanselmann authored 14 years ago


It's passed in by most users (daemons, CLI scripts) and for the others
(burnin, watcher) it certainly doesn't hurt, especially when using
syslog.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

cfcc79c6

utils.log: Restrict I/O error handling coverage · aa0cc3e5

Michael Hanselmann authored 14 years ago


The I/O error will occur while opening the file, not while adding
and configuring the handler.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

aa0cc3e5

utils.log: Split formatter building into separate function · d24bc000
Michael Hanselmann authored 14 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
```
d24bc000

Feb 01, 2011

Enforce that new node groups have unique names · 18ffc0fe

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

18ffc0fe

Add _UnlockedLookupNodeGroup() · e85d8982

Stephen Shirley authored 14 years ago


This allows calling of _UnlockedLookupNodeGroup() from within
AddNodeGroup()

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

e85d8982

Jan 31, 2011

Minor grammar fix in QuitGanetiException docstring · 7e975535

Stephen Shirley authored 14 years ago


Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7e975535

Introduce re-openable log record handler · b6fa9a44

Michael Hanselmann authored 14 years ago


This patch adds a new log handler class based on the standard library's
BaseRotatingHandler. This new class allows the log file to be re-opened,
e.g. upon receiving a SIGHUP signal. The latter will be implemented in
forthcoming patches. The patch does not change the behaviour regarding
writing to /dev/console.

Quite a bit of code had to be changed to unittest the log handlers.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b6fa9a44

Jan 28, 2011

Re-create instance disk symlinks on activate · c417e115

Iustin Pop authored 14 years ago


This patch implements recreation of instance disk symlinks when the
activate-disks operation is run. Until now, it was not possible to
re-create these symlinks without stopping and starting or migrating an
instance as the RPC call where this is done was in instance startup
and migration.

In order to do this, the blockdev_assemble rpc call needs the disk
index too, which is added to the protocol. This is a change from 2.3
and makes instance startup incompatible (FYI).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c417e115

Add RAPI resource for instance console · b82d4c5e

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b82d4c5e

Export console information as query field · 5d28cb6f

Michael Hanselmann authored 14 years ago


This makes it possible to get the console information via a LUXI query.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5d28cb6f

ConfigWriter: add checks for be/nd/nic params · 26f2fd8d

Iustin Pop authored 14 years ago


This adds checking (in the configuration) for invalid be, nd and nic
params. The code is a bit tricky as nd params are at cluster,
nodegroup and node level, nicparams are at cluster and nic level,
whereas beparams are at cluster and instance level.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

26f2fd8d

ConfigWriter: simplify _UnlockedVerifyConfig · 7e01d204

Iustin Pop authored 14 years ago


This just adds a 'cluster' local variable for reducing duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

7e01d204

Add e1000 nic support for HVM · a4d9fbac

Guido Trotter authored 14 years ago


Closes issue: 130

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a4d9fbac

Prevent removal of last node group · 0389c42a

Stephen Shirley authored 14 years ago


- Add check in ConfigWriter to prevent last node group from being
  removed
- Tidy up error message a bit

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

0389c42a

Fix instance list for instances running multiple times · e431074f

René Nussbaumer authored 14 years ago


If for some reason (e.g. failed migration) one instance is running
on multiple nodes the output can become inconsistent. To get that error
and make it consistent between runs we make the call on the secondary
too and look if it's running there. If so we report the instance as
ERROR_wrongnode.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e431074f

Jan 27, 2011

cluster verify: add hvparams verification · 58a59652

Iustin Pop authored 14 years ago


Currently, the validity of the hypervisor parameters is only checked
at init/modification time, and not in the cluster verify. This is bad,
as it can lead to inconsistent state that is only detected when the
next modification (which can be unrelated) is made, leading to
unexpected error messages.

This patch adds both syntax verification (in masterd) and validity
verification on remote nodes. The downside of the patch is that on
clusters with many instances which have custom parameters, it will be
slow. A possible improvement would be to detect duplicate, identical
set of parameters, and collapse these into a single verification, but
that is left as a TODO (in case it becomes problematic).

An additional change is in utils.ForceDict, where we said 'key',
whereas this function is always used with parameter dicts, so I
changed it to "Unknown parameter".

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

58a59652

RAPI client: De-/activating instance disks · b680c8be

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

b680c8be

RAPI client: Wrap /2/redistribute-config resource · 54d4c13b

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

54d4c13b

Watcher: Fix endless repair tries for broken secondary · 83e5e26f

René Nussbaumer authored 14 years ago


In cases where secondary was offline and not evacuated watcher tried
to activate-disks in an endless manner, but this is useless, as the
secondary is offline and therefore not responding to this approach.

This patch skips activation of the disk if the secondary is bad but
instance up and running.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

83e5e26f

Jan 26, 2011

Verify disks: increase parallelism and other fixes · 397693d3

Iustin Pop authored 14 years ago


The recent work on multi-VG support has converted LUClusterVerifyDisks
into doing serialised calls to each node, as each node can have
different VGs. This is suboptimal, especially for big clusters, where
this LU is executed by the watcher very often.

This patch changes the logic based on the observation that querying a
node for its VGs and then requesting a LV list for those VGs is
equivalent to simply asking for all LVs, without specifying the VG
name(s). So backend.py needs changes to accept an empty VG list, and
the LU itself partially reverts to the previous version.

Additionally, we do two other fixes to this LU:

- small improvement in getting the instance list from the config
- MapLVsByNode works for all disk types, hence no need to restrict to
  the DRBD template, especially as today we can "recreate" disks for
  plain volumes too (the warning message in gnt-cluster is updated
  too)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

397693d3

gnt-cluster verify-disks: fix VG name · fd78c5ce

Iustin Pop authored 14 years ago


Recent multi-VG work already exports the missing LV names as vg/lv,
not simply lv. So the query and addition of the VG name in gnt-cluster
verify-disks is redundant, and even wrong for non-default-VG
instances.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

fd78c5ce

Deactivate disks: allow skipping hypervisor checks · c9c41373

Iustin Pop authored 14 years ago


In some cases (e.g. the hypervisor not running at all), we might want
to force disk deactivation, skipping the hypervisor checks. I believe
this is not a good thing to do all the time, so this patch adds the
force option to allow manual selection of this operation mode.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c9c41373

Wait for master to become available on initialization · 3b6b6129

Michael Hanselmann authored 14 years ago


This is analogue to the existing check for a responsive node daemon.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3b6b6129

Start all daemons on cluster initialization · 952d7515

Michael Hanselmann authored 14 years ago


At least ganeti-confd was not started. It got started a few minutes
later by ganeti-watcher. Also move one pylint disable to the effective
line.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

952d7515

Improve option descriptions · 34616379

Michael Hanselmann authored 14 years ago


Also replace hardcoded “xenvg” with constant.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

34616379