Commits · 34e71feabfc1a4516ddfd207579f997efcabb4c5 · itminedu / snf-ganeti

May 05, 2009

Fix compatibility with DRBD 8.2 · 34e71fea

Karsten Keil authored 16 years ago


This patch adds (and suppresses) the extra ipv4/ipv6 words before the
actual address that newer DRBD versions add.

[iustin@google.com: slightly changed the patch to conform to style
guide, and changed the commit message]
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

34e71fea

RunCmd: log command line for missing cmd case · c803b052

Iustin Pop authored 16 years ago


In case of missing programs, currently utils.RunCmd doesn't show any
information to help debugging, only 'No such file or directory'. This
patch adds error handling for the ENOENT case such that at least we have
this information in the node daemon logs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

c803b052

Abstract Linux node information in hv_base · 572e52bf

Iustin Pop authored 16 years ago


Currently both hv_fake and hv_kvm implement practically identical code
to get the node information. Since future container-like hypervisors
will also need this functionality, this patch moves it into the base
class (as a separate function) which can then be called from classes
which need this info.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

572e52bf

May 04, 2009

Small optimisation in utils.WriteFile · 81b7354c

Iustin Pop authored 16 years ago


Currently we always try to remove the new file, even if the rename
succeeded. This patch tracks the existence of the new file and doesn't
try to remove it if we managed to rename it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

81b7354c

Fix luxi serialization in ganeti-masterd · dd36d829

Iustin Pop authored 16 years ago


Currently, lib/luxi.py used lib/serializer.py for encoding/decoding
messages, but the master daemon uses directly the simplejson module.
This is wrong as any non-trivial change to serializer.py will break the
master daemon.

The patch changes masterd to use exactly the same functions as luxi.py
for encoding/decoding of messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

dd36d829

Allow gnt-debug submit-job to take multiple args · 99036060

Iustin Pop authored 16 years ago


Currently “gnt-debug submit-job” takes a single argument and has
non-trivial startup-costs; in order to exercise the job system, it is
better to be able to submit multiple jobs with a single invocation of
the script.

This patch extends it to take multiple argument, de-serialize the
opcodes and then submit all of them as fast as possible, in order to
increase pressure on the master daemon.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Alexander Schreiber <als@google.com>

99036060

Include node name in hypervisor validation errors · d64769a8

Iustin Pop authored 16 years ago


The current validation routine just says "failed", without specifying
the node name. This is very confusing, and we should log the node name
too.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Alexander Schreiber <als@google.com>

d64769a8

Fix gnt-cluster getmaster on non-master nodes · 8eb148ae

Iustin Pop authored 16 years ago


The current implementation of “gnt-cluster getmaster” doesn't work on
non-master nodes, which is a regression from 1.2. This patch implements
it (again) via ssconf.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Alexander Schreiber <als@google.com>

8eb148ae

Apr 27, 2009
- Release 2.0rc4 · d1908b41
  Iustin Pop authored 16 years ago
```
Reviewed-by: ultrotter
```
  v2.0.0rc4
  
  d1908b41
Apr 24, 2009

Update gnt-instance(8) for info · d09ebf6f

Guido Trotter authored 16 years ago

Add the --all argument, and reword a bit the basic information.

Reviewed-by: iustinp

d09ebf6f

gnt-instance info --all · 220cde0b

Guido Trotter authored 16 years ago

Don't show all instances info by default, but require --all to be passed
for this time consuming operation.

Reviewed-by: iustinp

220cde0b

LUDiagnoseOS: change locking and error handling · a6ab004b

Iustin Pop authored 16 years ago

Since the “list OSes” call is exported via RAPI, this can be used pretty
easily to DOS the master daemon during long jobs.

The implementation of LUDiagnoseOS makes an RPC call to all nodes; we
lock nodes here in order to prevent node removal.

However, after closer examination, the worst case is:
  - we get the list of nodes from the config
  - another thread removes a node
  - our RPC queries reach the removed node

As this point, if ganeti-noded is stopped or doesn't accept our queries,
the RPC call will return failed, and in the current implementation all
OSes will become invalid.

If we change the ‘failed RPC’ handling to ignore such nodes, this allows
us to both remove locking, and to handle transient RPC failures better
(not invalidating all OSes).

This patch does both these things, with a single drawback: in gnt-os
diagnose, the down nodes do not appear at all. I think this is a small
drawback, and the alternative is to add them with status failed; this
works (3-line patch), but then the output of “list” and “diagnose” will
no longer be consistent. As such, my proposal is to not list the nodes.

Reviewed-by: ultrotter

a6ab004b

Fix verify-disks with broken volume groups · ea9ddc07

Iustin Pop authored 16 years ago

When a remote node returns invalid LVM data, we check it, but we don't
stop and continue with the rest of the checks (which require a valid
volume group). This raises an internal error and breaks verify disks.

This seems unchanged for a long while, I don't know why it surfaced just
recently.

Reviewed-by: ultrotter

ea9ddc07

Prevent errors when xenvg is broken cluster verify · 9a198532

Iustin Pop authored 16 years ago

When vg_name is not returned at all, we currently abort with an internal
error. This is because we don't catch KeyError.

This patch adds a custom message for this case, and also adds KeyError
to the list of catched exceptions, just for safety.

On the other hand, we could also just remove this piece of code since
it's not used at all the ["dfree"] value.

Reviewed-by: ultrotter

9a198532

Apr 15, 2009

A bunch of doc and other small fixes · 949bdabe

Iustin Pop authored 16 years ago

This patch adds a couple of both externally and internally reported
issues:
  - missing SGML tags (Issue 54), report and patch by superdupont
  - wrong variable used in the init.d script, report and patch by
    Karsten Keil <karsten-keil@t-online.de>
  - man page for gnt-instance reinstall needs clarification (Issue 56)
  - gnt-instance man page missing --disks documentation for
    replace-disks
  - gnt-node modify help output is unclear about the -C/-D/-O input
    format, and the man page doesn't document this command at all
  - “gnt-node modify -C yes” for offline or drained nodes had wrong
    error message
  - “gnt-instance reinstall --select-os” has wrong prompt, we only
    accept a number for the OS and not the template name

Reviewed-by: ultrotter

949bdabe

Apr 14, 2009
- Trivial typo fix in error message · 8c7aaa72
  Alexander Schreiber authored 16 years ago
```
Reviewed-by: iustinp
```
  8c7aaa72
Apr 08, 2009
- Release 2.0rc3 · 5bbefdec
  Iustin Pop authored 16 years ago
```
Burnin tests were successful, release rc3.

Reviewed-by: imsnah
```
  v2.0.0rc3
  
  5bbefdec
Apr 07, 2009

Distribute built documentation · 2ab2b9f5

Iustin Pop authored 16 years ago

This patch changes the way documentation is built in order to distribute
the generated output in the 'dist' archive, and thus no longer
requiring the presence of the docbook/rst toolchains during build time.
This will lower the requirements for installation and also makes the
build time insignificant.

First, we remove the docbook2pdf rules and variables, since we no longer
build this kind of docs. Furthermore, the rst source files are not
(today) processed via replace_vars_sed, so the whole .in rules for doc/
go away.

Next, we change the ".sgml|.rst -> replace_vars_sed -> .in -> processor
-> final file" processing to ".sgml|.rst -> generator -> .in ->
replace_vars_sed -> final file"; this means we first process the file
using the formatter, with the @VARIABLE@ entries in it, and save the
output as .in; this output we distribute, and on the user side, the
replace_vars_sed will use the new configure flags to transform the
(almost final .in form) to the final form, without needing the
toolchain.

In configure.ac we also change from ERROR to WARN for the documentation
generators, and extra tests in Makefile.am check that the programs have
been found.

This was tested with distcheck and works as expected.

Reviewed-by: ultrotter

2ab2b9f5

Apr 06, 2009

Disable synchronous (locking) queries · 77921a95

Iustin Pop authored 16 years ago

This patch raises an error in the master daemon in case the user
requests a locking query; accordingly, all clients were modified to send
only lockless queries. This is short-term fix, for proper fix the
clients should be modified to submit a job when the user request a
locking query.

The other approach would be to ignore the flag passed by the client;
this would be worse as client's wouldn't get at least an error.

The possible impact of this is multiple:
  - some commands could have been not converted, and thus fail; this
    can be remedied easily
  - the consistency of commands is lost; e.g. node failover will not
    lock the node *while we get the node info*, so we could miss some
    data; this is again in the thread of atomic operations which are
    missing in the current model of query-and-act from gnt-* scripts

Reviewed-by: imsnah, ultrotter

77921a95

Fix the output of watcher on non-master nodes · 2c404217

Iustin Pop authored 16 years ago

Currently the watcher spews errors message on non-master nodes. This
cleans it up.

Reviewed-by: imsnah

2c404217

Change the watcher to use jobs instead of queries · 6dfcc47b

Iustin Pop authored 16 years ago

As per the mailing list discussion, this patch changes the watcher to
use a single job (two opcodes) for getting the cluster state (node list
and instance list); it will then compute the needed actions based on
this data.

The patch also archives this job and the verify-disks job.

Reviewed-by: imsnah

6dfcc47b

Fix Xen soft reboot via polling · 7dd106d3

Iustin Pop authored 16 years ago

This patch fixes the Xen soft reboot ("xm reboot") via polling for a specific
time for either changed domain ID or decreased CPU run-time.

This sould prevent the race-conditions discussed on the mailing list for
reboots.

Reviewed-by: imsnah

7dd106d3

Add a new ssconf file with the cluster tags · 5d60b3bd

Iustin Pop authored 16 years ago

Since the cluster tags are/should be more-or-less static, add them as an
ssconf key, so that querying them is possible without creating a
job/requiring the masterd to be running.

Reviewed-by: imsnah

5d60b3bd

Add some more debugging info to masterd · e566ddbd

Iustin Pop authored 16 years ago

This patch will log data about queries, which are today completely
invisible (at the default log level) in the master log file.

Reviewed-by: imsnah

e566ddbd

Mar 27, 2009

Release 2.0rc2 · f06d91f2

Iustin Pop authored 16 years ago

This updates the NEWS file and bumps up the version number.

Reviewed-by: ultrotter

f06d91f2

Mar 20, 2009

Fix _NOQUOTE regexp · 8a088b79

Guido Trotter authored 16 years ago

Allow expressions longer than one character to match.

Reviewed-by: imsnah

8a088b79

Mainloop: avoid calculating timeout every time · 53d47a06
Guido Trotter authored 16 years ago
```
set timeout_needs_update to False after calculating the timeout.

Reviewed-by: imsnah
```
53d47a06

Raise on invalid gnt-cluster queue commands · 2e668b38

Guido Trotter authored 16 years ago

 # gnt-cluster queue foo
 Failure: prerequisites not met for this operation:
 Command 'foo' is not valid.

Reviewed-by: iustinp

2e668b38

Mar 12, 2009

kvm: use the correct vnc bind address · 19498d6c

Guido Trotter authored 16 years ago

There is a bug in kvm, when binding vnc to a specific address the
constant 'vnc_bind_address' is passed in, instead of the actual
requested address. This patch fixes it.

Reviewed-by: iustinp

19498d6c

Add the 2.0-specific node flags to the design doc · e0eb13de

Iustin Pop authored 16 years ago

This patch adds the newly-introduced node flags to the design document,
as they currently are missing from there.

The patch also reduces the TOC depth to 3, as it was too big.

Reviewed-by: ultrotter

e0eb13de

Fix the --net option to gnt-instance add · dc30b0e4

Iustin Pop authored 16 years ago

Similar to the --disk fixes a while ago, --net is broken too. This patch
fixes it.

Reviewed-by: imsnah

dc30b0e4

Mar 10, 2009
- Xen: Remove one hardcoded constant · 6b405598
  Guido Trotter authored 16 years ago
```
s/"vnc_bind_address"/constants.HV_VNC_BIND_ADDRESS/

Reviewed-by: imsnah
```
  6b405598
Mar 09, 2009

watcher: fix startup sequence locking the master · cc962d58

Iustin Pop authored 16 years ago

Currently, the watcher startup sequence does:
  - open a luxi client
  - get the instance list
  - get the node boot ids
  - open and lock the status file, and:
    - archive jobs
    - restart the down instances
    - check disks

This, of course, can lead to problems when a node is (genuinely or not)
locked for more than (watcher interval * maximum query clients) time. At
that time, the master is completely unresponsive until the node is
unlocked and all the watchers exit with error due to the state file
being locked by the first instance.

This patch reworks the startup sequence to first open/lock the status
file, and only then open a luxi client. This should prevent the above
case.

Reviewed-by: ultrotter

cc962d58

Handle ghost instances in temp DRBD map · c614e5fb

Iustin Pop authored 16 years ago

Currently cluster-verify doesn't handle the (admitedly invalid) case where we
have reservation for instances that were removed in the meantime.

This patch adds a check for this and prevents code errors in cluster-verify in
this case:
 * Verifying node node4.example.com (master candidate)
   - ERROR: ghost instance \'instance3.example.com\' in temporary DRBD map

Reviewed-by: imsnah

c614e5fb

Fix error handling in replace-disks with new node · 82759cb1

Iustin Pop authored 16 years ago

Currently the _CreateSingleBlockDev function only raises OpExecError and not
BlockDeviceError. This means that we don't release the instance's temporary
minors properly, and this creates problems later if the instance is removed
without master restart.

We could just use OpExecError, but adding it and leaving
BlockDeviceError in seems safer.

Reviewed-by: imsnah

82759cb1

Mar 06, 2009

Fix serial_no field on instances · 6f285030

Iustin Pop authored 16 years ago

The instance objects did not get a serial_no field. This patch adds a
new constants for the field name and uses it for all three cases
(cluster, nodes, instances).

Reviewed-by: imsnah

6f285030

Mar 05, 2009

Update gnt-cluster(8) for be/hyp parameter syntax · 555918b3

Guido Trotter authored 16 years ago

Now it displays:

--hypervisor-parameters hypervisor:hv-param=value [ ,hv-param=value ... ]
--backend-parameters be-param=value [ ,be-param=value ... ]

Sorry for the super-long lines :( Is there a better way to insert spaces
without pushing them to the resulting man page?

Reviewed-by: iustinp

555918b3

Mar 04, 2009

Complete the cfgupgrade script for 2.0 migrations · ac4d25b6

Iustin Pop authored 16 years ago

This patch makes the cfgupgrade script to handle:
  - instance changes
  - disk changes
  - further cluster fixes
  - adds configuration checks at the end, in non-dry-run mode

Reviewed-by: ultrotter

ac4d25b6

First run at cfgupgrade for 2.0 upgrades · a421fdeb

Iustin Pop authored 16 years ago

This patch makes cfgupgrade work on empty cluster (i.e. no instances),
up to a point that the config file can be converted from 1.2 to 2.0.
This is not yet complete, though.

Reviewed-by: ultrotter

a421fdeb

Fix bash completion for cluster copyfile/command · 75615bd3

Iustin Pop authored 16 years ago

“copyfile” takes a file argument, so we enable file-completion for it.
“gnt-cluster command” takes a command, so we enable command completion.

Reviewed-by: imsnah

75615bd3