Commits · e3bdb1c2c46c2a6292d09e10996fb7d5cdb38be7 · itminedu / snf-ganeti

Nov 02, 2009

Some improvements to gnt-node repair-storage · 7e9c6a78

Iustin Pop authored 15 years ago


Currently the repair storage has two issues:

- down instances are aborting the operation, even though they should be
  ignored (it's not technically possible to know their disk status
  unless we would activate their disks)
- if the VG is so broken that disks cannot be activated via gnt-instance
  activate-disks or gnt-instance startup, it's not possible to repair
  the VG at all

The patch makes the opcode skip down instances and also introduces an
``--ignore-consistency`` flag for forcing the execution of the LU.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7e9c6a78

Oct 13, 2009

opcodes: Add missing shutdown_timeout to OpRemoveInstance · fc1baca9
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
fc1baca9

Add timeout options to other LUs · 17c3f802

Guido Trotter authored 15 years ago


All the LUs that shut down the instance need to be able too pass the
timeout parameter as well.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

17c3f802

Oct 09, 2009

Accept shutdown timeout from the user · 6263189c

Guido Trotter authored 15 years ago


Using the new --timeout option:

- gnt-instance shutdown is changed to accept a timeout
- the opcode is changed to hold one
- the LU is changed to optionally get one
- the rpc is changed to carry one
- the backend is changed to take it as a parameter rather than
  hardcoding it in the function

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6263189c

Oct 05, 2009

Add force_variant slot to Create/ReinstallInstance · 47804ec9

Guido Trotter authored 15 years ago


These two opcode need to know whether an unknown variant must be forced
through or not.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

47804ec9

Sep 17, 2009

Add an error-simulation mode to cluster verify · a0c9776a

Iustin Pop authored 15 years ago


One of the issues we have in ganeti is that it's very hard to test the
error-handling paths; QA and burnin only test the OK code-path, since
it's hard to simulate errors.

LUVerifyCluster is special amongst the LUs in the fact that a) it has a
lot of error paths and b) the error paths only log the error, they don't
do any rollback or other similar actions. Thus, it's enough for this LU
to separate the testing of the error condition from the logging of the
error condition.

This patch does this by replacing code blocks of the form:

  if x:
    log_error()
    [y]

into:

  log_error_if(x)
  [if x:
    y
  ]

After this change, it's simple enough to turn on logging of all errors
by adding a special case inside log_error_if such that if the incoming
opcode has a special ‘debug_simulate_errors’ attribute and it's true, it
will log unconditionally the error.

Surprisingly this also turns into an absolute code reduction, since some
of the if blocks were simplified. The only downside to this patch is
that the various _VerifyX() functions are now stateful (modifying an
attribute on the LU instance) instead of returning a boolean result.

Last note: yes, this discovered some error cases in the logging.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a0c9776a

Introduce parseable error codes in LUVerifyCluster · 7c874ee1

Iustin Pop authored 15 years ago


Currently the output of cluster verify can be parsed for 'ERROR'
messages, but that is the only indication we get (error or no error). In
order to allow monitoring tools to separate different error conditions,
this patch introduces a new output format (“gnt-cluster verify
--error-codes”) that changes the output from human-friendly to
machine-friendly. In this mode, an error line changes from:
  ERROR: node node1: drbd minor 1 of instance inst1.is not active

to:
  ERROR:ENODEDRBD:node:node1:drbd minor 1 of instance inst1 is not active

i.e. the error message is a ‘:’-separated field, with ERROR in the first
place, the error code in the second, the object type (cluster, node,
instance) in the third, the name of the object (for nodes/instances) in
the fourth, and then the text message.

The patch also removes some of the verbosity of the operation
(“Verifying instance X”, “Verifying node X”) since on big clusters these
informational messages can quickly fill up an entire screen. The
original behaviour can be restored via the ‘--verbose’ option.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7c874ee1

Aug 24, 2009

Add OPMoveInstance and LUMoveInstance · 313bcead

Iustin Pop authored 15 years ago


This patch adds a basic version of LUMoveInstance. It doesn't yet
support iallocator-mode and it's implemented in old-style (non-TL) mode.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

313bcead

Aug 17, 2009

Add opcode to repair storage volumes · 76aef8fc

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

76aef8fc

Aug 14, 2009

Implement instance recreate-disks · bd315bfa

Iustin Pop authored 15 years ago


This can be used for a 'plain' type instance when the underlying storage
went away, to recreate the storage (and reinstall) instead of removing
the instance and readding it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

bd315bfa

Aug 10, 2009

Post cluster initialization LU · b5f5fae9

Luca Bigliardi authored 15 years ago


Add an 'empty' logical unit to run hooks after cluster initialization.

Signed-off-by: Luca Bigliardi <shammash@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b5f5fae9

Aug 04, 2009

Implement gnt-cluster check-disk-sizes · 60975797

Iustin Pop authored 15 years ago


This patch adds a new opcode and lu for checking disk sizes. Currently
it does only top-level disk verification, and also doesn't check
primary/secondary node size mismatches (these two are added as TODOs in
the Exec() function of the LU).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

60975797

Implement --ignore-size in activate-disks · b4ec07f8

Iustin Pop authored 15 years ago


This patch modified OpActivateDisks, LUActivateDisks and gnt-instance
activate-disks to support and pass this option to
_AssembleInstanceDisks.

The patch is quite trivial I think; there should be no issues from it
except if used when not needed.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b4ec07f8

cmdlib: Add opcode to modify storage unit fields · efb8da02

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

efb8da02

Aug 03, 2009

Add new opcode to list physical volumes · 9e5442ce

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9e5442ce

Jul 31, 2009

cmdlib: Add new opcode to migrate node · 80cb875c

Michael Hanselmann authored 15 years ago


It migrates all primary instances from the node to their secondaries.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

80cb875c

Jul 22, 2009

Add new opcode to evacuate nodes · 7ffc5a86

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

7ffc5a86

Jul 17, 2009

Optimizie OpCode loading · 363acb1e

Iustin Pop authored 15 years ago


This patch converts the opcode loading to a pre-built map (at import
time) instead of iteration over the globals dict at each call.

Microbenchmarks show that this should be around three times faster, and
burnin still passes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

363acb1e

Jun 19, 2009

LU execution: implement dry-run framework · 20777413

Iustin Pop authored 16 years ago


This patch adds a new (global) opcode flag 'dry_run' which, when True,
causes early exit from the LU workflow, returning a special value from
the LU object (initialized in the parent LogicalUnit class, and which if
not overriden from child LUs will be None).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

20777413

Introduce __slots__ deriving in opcodes.py · 4f05fd3b

Iustin Pop authored 16 years ago


This simple patch adds to all opcodes extension of the base opcode
__slots__. This way we can add slots across all opcodes, for example
'dry-run'.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

4f05fd3b

Jun 08, 2009

Allow modifying of default nic parameters · 5af3da74

Guido Trotter authored 16 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5af3da74

May 27, 2009

Add a node powercycle command · f5118ade

Iustin Pop authored 16 years ago


This (somewhat big) patch adds support for remotely rebooting the nodes
via whatever support the hypervisor has for such a concept.

For KVM/fake (and containers in the future) this just uses sysrq plus a
‘reboot’ call if the sysrq method failed. For Xen, it first tries the
above, and then Xen-hypervisor reboot (we first try sysrq since that
just requires opening a file handle, whereas xen reboot means launching
an external utility).

The user interface is:

    # gnt-node powercycle node5
    Are you sure you want to hard powercycle node node5?
    y/[n]/?: y
    Reboot scheduled in 5 seconds

The node reboots hopefully after sending the reply. In case the clock is
broken, “time.sleep(5)” might take ages (but then I suspect SSL
negotiation wouldn't work).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

f5118ade

May 19, 2009

Add -H/-B startup parameters to gnt-instance · d04aaa2f

Iustin Pop authored 16 years ago


This patch modifies the start instance script, opcode and logical unit
to support temporary startup parameters.

Different from 1.2, where only the kernel arguments were supporting
changes (and thus xen-pvm specific), this version supports changing all
hypervisor and backend parameters (with appropriate checks).

This is much more flexible, and allows for example:
  - start with different, temporary kernel
  - start with different memory size

Note: in later versions, this should be extended to cover disk
parameters as well (e.g. start with drbd without flushes, start with
drbd in async mode, etc.).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

d04aaa2f

Feb 24, 2009

Remove the extra_args parameter in instance start · 07813a9e

Iustin Pop authored 16 years ago

This patch removes the extra_args parameter and instead switches the
instance to the HV_KERNEL_ARGS hypervisor option.

This is a big change, but it's a needed cleanup, this extra parameter on
all RPC calls is not generic and we also need to have a persistent value
here.

Reviewed-by: imsnah

07813a9e

Feb 10, 2009

Implement modification of the drained flag · c9d443ea

Iustin Pop authored 16 years ago

This patch adds LU and cli-level support for modification of the node
drained flag. It is similar to the offline changes.

Reviewed-by: imsnah

c9d443ea

Feb 06, 2009

Fix rapi job listing · ee69c97f

Iustin Pop authored 16 years ago

This patch fixes a couple of issues with the job listing:
  - in case of a non-existing job, nicely raise 404 instead of 500
  - in the job detail listing, also list the job log, the job
    timestamps, etc.
  - the opcode migrate instance was missing its description field

Reviewed-by: imsnah

ee69c97f

Feb 04, 2009

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 20, 2009
- Fix a couple of epydoc warnings · 2f907a8c
  Iustin Pop authored 16 years ago
```
Reviewed-by: ultrotter
```
  2f907a8c
Jan 13, 2009

Forward port the live migration from 1.2 branch · 53c776b5

Iustin Pop authored 16 years ago

This is forward port via copy (and not individual patches cherry-pick)
of the latest code on the 1.2 branch related to the migration.

The changes compared to 1.2 are the fact that we don't need the
IdentifyDisks step anymore (the drbd rpc calls are independent now), and
the rpc module improvements.

Reviewed-by: ultrotter

53c776b5

Jan 12, 2009

Introduce a very simple LU to force config updates · afee0879

Iustin Pop authored 16 years ago

This LU can be used to force a push of the config in case it's needed,
for example after an upgrade to update the ssconf_release_version file.

Reviewed-by: imsnah

afee0879

Dec 08, 2008

gnt-node modify: add the offline attribute · 3a5ba66a

Iustin Pop authored 16 years ago

This patch changes gnt-node modify and the associated opcode/lu to allow
modification of the node offline attribute.

Setting a node into offline mode automatically demotes it from the
master role.

Reviewed-by: ultrotter

3a5ba66a

Dec 02, 2008

Add cluster candidate pool size parameter · 4b7735f9

Iustin Pop authored 16 years ago

This patch adds a new cluster paramater "candidate_pool_size" which
tracks the desired size of the list of nodes with the master_candidate
flag set.

Reviewed-by: imsnah

4b7735f9

Add a gnt-node modify operation · b31c8676

Iustin Pop authored 16 years ago

This patch adds the OpCode, LogicalUnit and gnt-node command for
modifying node parameters, more specifically the master candidate flag
for a node.

Reviewed-by: imsnah

b31c8676

Nov 25, 2008

Implement support for multi devices changes · 24991749

Iustin Pop authored 16 years ago

This big patch adds support for:
  - changing NIC/disks in the multi-device model
  - adding/removing NICs
  - adding/removing disks

The patch is big and not very nice; the error checking paths are not
very clear.

The biggest problem is that from a simple instance.ATTR=VAL change
(which didn't throw errors before) now we are creating and removing
disks in this LU.

Reviewed-by: imsnah

24991749

Nov 24, 2008

IAllocator: use the right hypervisor · 8cc7e742

Guido Trotter authored 16 years ago

Since the hypervisor is instance dependent we'll get one on instance creation,
and use the one in the instance config on relocation.

Reviewed-by: iustinp

8cc7e742

Nov 20, 2008

Initial multi-disk/multi-nic support · 08db7c5c

Iustin Pop authored 16 years ago

This patch adds support for mult-disk/multi-nic in:
  - instance add
  - burnin

The start/stop/failover/cluster verify work as expected. Replace disk
and grow disk are TODO.

There's also a change gnt-job to allow dictionaries to be listed in
gnt-job info.

Reviewed-by: imsnah

08db7c5c

Oct 16, 2008

Enable gnt-cluster modify to hv/beparams · 779c15bb

Iustin Pop authored 16 years ago

This patch enables the cluster modify to change:
  - enabled hypervisor list
  - hvparams (per hypervisor)
  - beparams (only the default group)

Syntax:
  gnt-cluster modify -B vcpus=3 -H xen-pvm:no_initrd_path

Validation for parameters is somewhat missing - the individual
hypervisors will be checked for syntax and validation, but beparams
doesn't have validation yes (nowhere), it should be added here once we
have a global method (will come soon).

Reviewed-by: imsnah

779c15bb

Oct 14, 2008

grow-disk: wait until resync is completed · 6605411d

Iustin Pop authored 16 years ago

The patch adds a new ‘--no-wait-for-sync’ parameter to grow-disk similar
to the one in instance add, and changes the default to wait.

This is cleaner as at the moment when the command returns, we either
have a fully synced disk or there is an error.

This is a forward-port of rev 1183 on the 1.2 branch.

Reviewed-by: ultrotter

6605411d

Change over to beparams · 338e51e8

Iustin Pop authored 16 years ago

This big patch changes the master code to use the beparams. Errors might
have crept in, but it passes a small burnin.

Reviewed-by: ultrotter

338e51e8

Allow instance info to only query the config file · 57821cac

Iustin Pop authored 16 years ago

This patch adds a new '-s' parameter to ‘gnt-instance info’ that makes
it return only 'static' information. This is much faster, especially for
drbd instances.

This is a forward-port of rev 1570 on the ganeti-1.2 branch, resending
due to some conflicts.

Reviewed-by: imsnah

57821cac