Commits · 316243820576971d33f1cfdf5db76ca07ec03fa2 · itminedu / snf-ganeti

Mar 10, 2010

Fix a python 2.6.5 compatibility · 44db3a6f

Iustin Pop authored 15 years ago


The upcoming python 2.6.5 release has a change that makes delattr(obj,
attr) fail for slots-enabled objects if the attr is not already set.

To prevent against this, we only run the delattr if the attribute is
already set.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

44db3a6f

Mar 09, 2010

Rework the node modify for mc-demotion · 601908d0

Iustin Pop authored 15 years ago


The current code in LUSetNodeParms regarding the demotion from master
candidate role is complicated and duplicates the code in ConfigWriter,
where such decisions should be made. Furthermore, we still cannot demote
nodes (not even with force), if other regular nodes exist.

This patch adds a new opcode attribute ‘auto_promote’, and changes the
decision tree as follows:

- if the node will be set to offline or drained or explicitly demoted
  from master candidate, and this parameter is set, then we lock all
  nodes in ExpandNames()
- later, in CheckPrereq(), if the node is
  indeed a master candidate, and the future state (as computed via
  GetMasterCandidateStats with the current node in the exception list)
  has fewer nodes than it should, and we didn't lock all nodes, we exit
  with an exception
- in Exec, if we locked all nodes, we do a AdjustCandidatePool() run, to
  ensure nodes are locked as needed (we do it before updating the node
  to remove a warning, and prevent the situation that if the LU fails
  between these, we're not left with an inconsistent state)

Note that in Exec we run the AdjustCP irrespective of any node state
change (just based on lock status), so we might simplify the CheckPrereq
even more by not checking the future state, basically requiring
auto_promote/lock_all for master candidates, since the case where we
have more than needed master candidates is rarer; OTOH, this would prevent
manual promotion ahead of time of another node, which is why I didn't
choose this way.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

601908d0

Add support for per-os-hypervisor parameters · 17463d22

René Nussbaumer authored 15 years ago


This patch implements all modifications to support per-os-hypervisor
parameters in the framework.

Signed-off-by: René Nussbaumer <rn@google.com>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

17463d22

Feb 22, 2010

Add a new opcode for node evacuation · d6aaa598

Iustin Pop authored 15 years ago


We add this as a new opcode since we don't want to alter the behaviour
of current opcodes/lus.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d6aaa598

Implement support for mevac in OpTestAllocator · 823a72bc

Iustin Pop authored 15 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

823a72bc

Feb 11, 2010

Add a generic 'debug_level' attribute to opcodes · ee844e20

Iustin Pop authored 15 years ago


Also automatically fix opcodes which have this missing in the LU init
routine.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

ee844e20

Feb 10, 2010

Fix dumpers/loaders after __slots__ cleanup · adf385c7

Iustin Pop authored 15 years ago


Commit 154b9580 changed (correctly) the __slots__ usage, but this broke
dumpers/loaders since we relied directly on the own class __slots__
field.

To compensate, we introduce a simple function for computing the slots
across all parent classes (if any), and use this instead of __slots__
directly.

Note: the _all_slots() function is duplicated between objects.py and
opcodes.py, but the only other options is to introduce a lang.py for
such very basic language items.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

adf385c7

Feb 09, 2010

Add an early release lock/storage for disk replace · 7ea7bcf6

Iustin Pop authored 15 years ago


This patch adds an early_release parameter in the OpReplaceDisks and
OpEvacuateNode opcodes, allowing earlier release of storage and more
importantly of internal Ganeti locks.

The behaviour of the early release is that any locks and storage on all
secondary nodes are released early. This is valid for change secondary
(where we remove the storage on the old secondary, and release the locks
on the old and new secondary) and replace on secondary (where we remove
the old storage and release the lock on the secondary node.

Using this, on a three node setup:

- instance1 on nodes A:B
- instance2 on nodes C:B

It is possible to run in parallel a replace-disks -s (on secondary) for
instances 1 and 2.

Replace on primary will remove the storage, but not the locks, as we use
the primary node later in the LU to check consistency.

It is debatable whether to also remove the locks on the primary node,
and thus making replace-disks keep zero locks during the sync. While
this would allow greatly enhanced parallelism, let's first see how
removal of secondary locks works.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

7ea7bcf6

Jan 27, 2010

Fix __slots__ definitions · 154b9580

Balazs Lecz authored 15 years ago

According to http://docs.python.org/reference/datamodel.html#slots



* The action of a __slots__ declaration is limited to the class where it
  is defined. As a result, subclasses will have a __dict__ unless they
  also define __slots__ (which must only contain names of any
  /additional/ slots).

* If a class defines a slot also defined in a base class, the instance
  variable defined by the base class slot is inaccessible (except by
  retrieving its descriptor directly from the base class). This renders
  the meaning of the program undefined. In the future, a check may be
  added to prevent this.

Signed-off-by: Balazs Lecz <leczb@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>

154b9580

Dec 16, 2009

Op/LUCreateInstance support for (no) name checks · 5f23e043

Iustin Pop authored 15 years ago


This adds a new opcode parameter ‘name_check’ (similar to ip_check) that
is not required to be present (to easy backwards compatibility for
tools).

It also adds a CheckArguments to LUCreateInstance and changes the
workflow related to instance IP checks and NIC initialisation based on
it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

5f23e043

Nov 03, 2009

Another round of pylint-related style fixes · 099c52ad

Iustin Pop authored 15 years ago


A newer version of pylint, more warnings…

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

099c52ad

Nov 02, 2009

Some improvements to gnt-node repair-storage · 7e9c6a78

Iustin Pop authored 15 years ago


Currently the repair storage has two issues:

- down instances are aborting the operation, even though they should be
  ignored (it's not technically possible to know their disk status
  unless we would activate their disks)
- if the VG is so broken that disks cannot be activated via gnt-instance
  activate-disks or gnt-instance startup, it's not possible to repair
  the VG at all

The patch makes the opcode skip down instances and also introduces an
``--ignore-consistency`` flag for forcing the execution of the LU.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7e9c6a78

Oct 13, 2009

opcodes: Add missing shutdown_timeout to OpRemoveInstance · fc1baca9
Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
fc1baca9

Add timeout options to other LUs · 17c3f802

Guido Trotter authored 15 years ago


All the LUs that shut down the instance need to be able too pass the
timeout parameter as well.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

17c3f802

Oct 09, 2009

Accept shutdown timeout from the user · 6263189c

Guido Trotter authored 15 years ago


Using the new --timeout option:

- gnt-instance shutdown is changed to accept a timeout
- the opcode is changed to hold one
- the LU is changed to optionally get one
- the rpc is changed to carry one
- the backend is changed to take it as a parameter rather than
  hardcoding it in the function

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6263189c

Oct 05, 2009

Add force_variant slot to Create/ReinstallInstance · 47804ec9

Guido Trotter authored 15 years ago


These two opcode need to know whether an unknown variant must be forced
through or not.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

47804ec9

Sep 17, 2009

Add an error-simulation mode to cluster verify · a0c9776a

Iustin Pop authored 15 years ago


One of the issues we have in ganeti is that it's very hard to test the
error-handling paths; QA and burnin only test the OK code-path, since
it's hard to simulate errors.

LUVerifyCluster is special amongst the LUs in the fact that a) it has a
lot of error paths and b) the error paths only log the error, they don't
do any rollback or other similar actions. Thus, it's enough for this LU
to separate the testing of the error condition from the logging of the
error condition.

This patch does this by replacing code blocks of the form:

  if x:
    log_error()
    [y]

into:

  log_error_if(x)
  [if x:
    y
  ]

After this change, it's simple enough to turn on logging of all errors
by adding a special case inside log_error_if such that if the incoming
opcode has a special ‘debug_simulate_errors’ attribute and it's true, it
will log unconditionally the error.

Surprisingly this also turns into an absolute code reduction, since some
of the if blocks were simplified. The only downside to this patch is
that the various _VerifyX() functions are now stateful (modifying an
attribute on the LU instance) instead of returning a boolean result.

Last note: yes, this discovered some error cases in the logging.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a0c9776a

Introduce parseable error codes in LUVerifyCluster · 7c874ee1

Iustin Pop authored 15 years ago


Currently the output of cluster verify can be parsed for 'ERROR'
messages, but that is the only indication we get (error or no error). In
order to allow monitoring tools to separate different error conditions,
this patch introduces a new output format (“gnt-cluster verify
--error-codes”) that changes the output from human-friendly to
machine-friendly. In this mode, an error line changes from:
  ERROR: node node1: drbd minor 1 of instance inst1.is not active

to:
  ERROR:ENODEDRBD:node:node1:drbd minor 1 of instance inst1 is not active

i.e. the error message is a ‘:’-separated field, with ERROR in the first
place, the error code in the second, the object type (cluster, node,
instance) in the third, the name of the object (for nodes/instances) in
the fourth, and then the text message.

The patch also removes some of the verbosity of the operation
(“Verifying instance X”, “Verifying node X”) since on big clusters these
informational messages can quickly fill up an entire screen. The
original behaviour can be restored via the ‘--verbose’ option.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7c874ee1

Aug 24, 2009

Add OPMoveInstance and LUMoveInstance · 313bcead

Iustin Pop authored 15 years ago


This patch adds a basic version of LUMoveInstance. It doesn't yet
support iallocator-mode and it's implemented in old-style (non-TL) mode.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

313bcead

Aug 17, 2009

Add opcode to repair storage volumes · 76aef8fc

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

76aef8fc

Aug 14, 2009

Implement instance recreate-disks · bd315bfa

Iustin Pop authored 15 years ago


This can be used for a 'plain' type instance when the underlying storage
went away, to recreate the storage (and reinstall) instead of removing
the instance and readding it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

bd315bfa

Aug 10, 2009

Post cluster initialization LU · b5f5fae9

Luca Bigliardi authored 15 years ago


Add an 'empty' logical unit to run hooks after cluster initialization.

Signed-off-by: Luca Bigliardi <shammash@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b5f5fae9

Aug 04, 2009

Implement gnt-cluster check-disk-sizes · 60975797

Iustin Pop authored 15 years ago


This patch adds a new opcode and lu for checking disk sizes. Currently
it does only top-level disk verification, and also doesn't check
primary/secondary node size mismatches (these two are added as TODOs in
the Exec() function of the LU).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

60975797

Implement --ignore-size in activate-disks · b4ec07f8

Iustin Pop authored 15 years ago


This patch modified OpActivateDisks, LUActivateDisks and gnt-instance
activate-disks to support and pass this option to
_AssembleInstanceDisks.

The patch is quite trivial I think; there should be no issues from it
except if used when not needed.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b4ec07f8

cmdlib: Add opcode to modify storage unit fields · efb8da02

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

efb8da02

Aug 03, 2009

Add new opcode to list physical volumes · 9e5442ce

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9e5442ce

Jul 31, 2009

cmdlib: Add new opcode to migrate node · 80cb875c

Michael Hanselmann authored 15 years ago


It migrates all primary instances from the node to their secondaries.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

80cb875c

Jul 22, 2009

Add new opcode to evacuate nodes · 7ffc5a86

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

7ffc5a86

Jul 17, 2009

Optimizie OpCode loading · 363acb1e

Iustin Pop authored 15 years ago


This patch converts the opcode loading to a pre-built map (at import
time) instead of iteration over the globals dict at each call.

Microbenchmarks show that this should be around three times faster, and
burnin still passes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

363acb1e

Jun 19, 2009

LU execution: implement dry-run framework · 20777413

Iustin Pop authored 16 years ago


This patch adds a new (global) opcode flag 'dry_run' which, when True,
causes early exit from the LU workflow, returning a special value from
the LU object (initialized in the parent LogicalUnit class, and which if
not overriden from child LUs will be None).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

20777413

Introduce __slots__ deriving in opcodes.py · 4f05fd3b

Iustin Pop authored 16 years ago


This simple patch adds to all opcodes extension of the base opcode
__slots__. This way we can add slots across all opcodes, for example
'dry-run'.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

4f05fd3b

Jun 08, 2009

Allow modifying of default nic parameters · 5af3da74

Guido Trotter authored 16 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5af3da74

May 27, 2009

Add a node powercycle command · f5118ade

Iustin Pop authored 16 years ago


This (somewhat big) patch adds support for remotely rebooting the nodes
via whatever support the hypervisor has for such a concept.

For KVM/fake (and containers in the future) this just uses sysrq plus a
‘reboot’ call if the sysrq method failed. For Xen, it first tries the
above, and then Xen-hypervisor reboot (we first try sysrq since that
just requires opening a file handle, whereas xen reboot means launching
an external utility).

The user interface is:

    # gnt-node powercycle node5
    Are you sure you want to hard powercycle node node5?
    y/[n]/?: y
    Reboot scheduled in 5 seconds

The node reboots hopefully after sending the reply. In case the clock is
broken, “time.sleep(5)” might take ages (but then I suspect SSL
negotiation wouldn't work).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

f5118ade

May 19, 2009

Add -H/-B startup parameters to gnt-instance · d04aaa2f

Iustin Pop authored 16 years ago


This patch modifies the start instance script, opcode and logical unit
to support temporary startup parameters.

Different from 1.2, where only the kernel arguments were supporting
changes (and thus xen-pvm specific), this version supports changing all
hypervisor and backend parameters (with appropriate checks).

This is much more flexible, and allows for example:
  - start with different, temporary kernel
  - start with different memory size

Note: in later versions, this should be extended to cover disk
parameters as well (e.g. start with drbd without flushes, start with
drbd in async mode, etc.).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

d04aaa2f

Feb 24, 2009

Remove the extra_args parameter in instance start · 07813a9e

Iustin Pop authored 16 years ago

This patch removes the extra_args parameter and instead switches the
instance to the HV_KERNEL_ARGS hypervisor option.

This is a big change, but it's a needed cleanup, this extra parameter on
all RPC calls is not generic and we also need to have a persistent value
here.

Reviewed-by: imsnah

07813a9e

Feb 10, 2009

Implement modification of the drained flag · c9d443ea

Iustin Pop authored 16 years ago

This patch adds LU and cli-level support for modification of the node
drained flag. It is similar to the offline changes.

Reviewed-by: imsnah

c9d443ea

Feb 06, 2009

Fix rapi job listing · ee69c97f

Iustin Pop authored 16 years ago

This patch fixes a couple of issues with the job listing:
  - in case of a non-existing job, nicely raise 404 instead of 500
  - in the job detail listing, also list the job log, the job
    timestamps, etc.
  - the opcode migrate instance was missing its description field

Reviewed-by: imsnah

ee69c97f

Feb 04, 2009

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 20, 2009
- Fix a couple of epydoc warnings · 2f907a8c
  Iustin Pop authored 16 years ago
```
Reviewed-by: ultrotter
```
  2f907a8c
Jan 13, 2009

Forward port the live migration from 1.2 branch · 53c776b5

Iustin Pop authored 16 years ago

This is forward port via copy (and not individual patches cherry-pick)
of the latest code on the 1.2 branch related to the migration.

The changes compared to 1.2 are the fact that we don't need the
IdentifyDisks step anymore (the drbd rpc calls are independent now), and
the rpc module improvements.

Reviewed-by: ultrotter

53c776b5