Commits · 4bba8e4cb44ca5b357cbcf895c06a9ea4b4677f5 · itminedu / snf-ganeti

Apr 06, 2011

Increase the lock timeouts before we block-acquire · d385a174

Iustin Pop authored 13 years ago


This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d385a174

Feb 28, 2011

Fix LU processor's GetECId · 3ae70d76

Michael Hanselmann authored 14 years ago


The exception was never actually raised.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>

3ae70d76

Jan 10, 2011

mcpu: Automatically build the DISPATCH_TABLE · a1a7bc78

Iustin Pop authored 14 years ago


While reviewing dato's interdiff for the OpAssignGroupNodes, I
realised that we can do better. This patch replaces the hand-built
DISPATCH_TABLE with one built from the opcode.OP_MAPPING dict.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a1a7bc78

Dec 15, 2010
- Rename (Op|LU)OutOfBand to (Op|LU)OobCommand · 792af3ad
  René Nussbaumer authored 14 years ago
```
Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
  792af3ad
Dec 13, 2010

Add modification of node groups (OpCode/LU/CLI) · 4da7909a

Adeodato Simo authored 14 years ago


With this commit, only modification of the "ndparams" attribute is
supported.

Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

4da7909a

Dec 08, 2010

Group operations: OpCode and LU for renaming a group · 4fe5cf90

Adeodato Simo authored 14 years ago


Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4fe5cf90

Group operations: OpCode and LU for removing a group · 94bd652a

Adeodato Simo authored 14 years ago


Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

94bd652a

Group operations: OpCode and LU for adding a group · b1ee5610

Adeodato Simo authored 14 years ago


Signed-off-by: Adeodato Simo <dato@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b1ee5610

Dec 07, 2010

Adding new OpCode for OOB · eb64da59

René Nussbaumer authored 14 years ago


Register OpCode and Logical Unit in mcpu.py

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

eb64da59

Dec 01, 2010

Querying node groups: LU/Opcode · 70a6a926

Adeodato Simo authored 14 years ago


This adds opcodes.OpQueryGroups and cmdlib.LUQueryGroups.

Signed-off-by: Adeodato Simo <dato@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

70a6a926

Nov 29, 2010

Add OpQuery opcode · 83f72637

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

83f72637

Nov 16, 2010

Move locking.RunningTimeout to utils · 557838c1

René Nussbaumer authored 14 years ago


As we need this functionality in other places than just locking it makes
sense to move it to utils rather than keeping it in locking

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

557838c1

Oct 12, 2010

mcpu: Raise directly in _AcquireLocks · 900df6cd

Michael Hanselmann authored 14 years ago


Removes code duplication.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

900df6cd

Sep 24, 2010

mcpu: Implement priority for lock acquiring · f879a9c7

Michael Hanselmann authored 14 years ago


Until now the priority for lock acquires couldn't be passed
when running opcodes.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f879a9c7

Sep 23, 2010

mcpu: Adjust lock acquire strategy · a7770f03

Michael Hanselmann authored 14 years ago


The changes to job queue processing require some changes on this class'
interface. LockAttemptTimeoutStrategy might move to another place, but that'll
be done in a later patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

a7770f03

mcpu.Processor: Raise exception on lock acquire timeout · 831bbbc1

Michael Hanselmann authored 14 years ago


Right now the timeout is not passed by any caller, making the code
effectively go back to blocking acquires. Since the timeout is always
None, no caller needs to be changed in this patch.

This change also means that any LUXI query handled by ganeti-masterd
will use blocking acquires if they need locks (only the case for getting
tags).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

831bbbc1

Sep 13, 2010

Remove mcpu's ReportLocks callback · acf931b7

Michael Hanselmann authored 14 years ago


This is no longer needed with the new lock monitor. One callback is kept to
check for cancelled jobs.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

acf931b7

Jul 15, 2010

Add test for some aspects of job queue · e58f87a9

Michael Hanselmann authored 14 years ago


This new opcode and gnt-debug sub-command test some aspects of the
job queue, including the status of a job. The bug fixed in commit
2034c70d was identified using this test. A future patch will
run this test automatically from the QA scripts.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e58f87a9

Jul 12, 2010

Provide feedback function for all LU methods · 7b4c1cb9

Michael Hanselmann authored 14 years ago


By exposing mcpu's _Feedback function (now renamed to “Log”) to LU's,
methods like ExpandNames can also write to the job execution log.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7b4c1cb9

Jun 23, 2010

Remove the obsolete EvacuateNode OpCode/LU · 8de1f1ee

Iustin Pop authored 14 years ago


All code has been switched to the new-style LU… time for cleanup.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

8de1f1ee

May 18, 2010

Add opcode to prepare export · 1410fa8d

Michael Hanselmann authored 14 years ago


To prepare a remote export, the X509 key and certificate need to be generated.
A handshake value is also returned for an easier check whether both clusters
share the same cluster domain secret.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

1410fa8d

Feb 22, 2010

Add LUNodeEvacuationStrategy · f7e7689f

Iustin Pop authored 15 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

f7e7689f

Jan 25, 2010

Fix an unsafe formatting bug · 62579388

Iustin Pop authored 15 years ago


This might fix issue 84; in any case, the current situation is that we
have a potentially unsafe formatting, which should be fixed.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

62579388

Jan 13, 2010

mcpu: Log lock status with sorted names · 4776e022

Michael Hanselmann authored 15 years ago


Reading and comparing sorted lists is easier when debugging locking problems.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

4776e022

Jan 04, 2010

Add targeted pylint disables · 7260cfbe

Iustin Pop authored 15 years ago


This patch should have only:

- pylint disables
- docstring changes
- whitespace changes

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

7260cfbe

Remove many 'Unused variable' warnings · 1122eb25

Iustin Pop authored 15 years ago


Note there are some cases left which need extra cleanup.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

1122eb25

Dec 28, 2009

Add targetted pylint disables · fe267188

Iustin Pop authored 15 years ago


This patch adds targeted pylint disables, where it makes sense (either
due to limitations in pylint or due to historical usage), and also a few
blanket ones in rapi where all the names are… “different”.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

fe267188

Nov 06, 2009

Add config.DropECReservations · 73064714

Guido Trotter authored 15 years ago


For now this function does nothing, but it gets called by mcpu when the
execution of an LU is done, making sure any pending reservations are
dropped.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

73064714

Processor: support a unique execution id · adfa97e3

Guido Trotter authored 15 years ago


When the processor is executing a job, it can export the execution id to
its callers. This is not supported for Queries, as they're not executed
in a job.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

adfa97e3

Nov 02, 2009

Convert the rest of the OpPrereqError users · debac808

Iustin Pop authored 15 years ago


This finishes the conversion of OpPrereqError creation to two-argument
style. Any leftovers as one-argument are not breaking anything, just
losing information about the errors.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

debac808

Oct 15, 2009

mcpu: Use new timeout class for timeout · a6db1af2

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a6db1af2

locking, mcpu: Ensure timeout is always >= 0.0 · b6b87034

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b6b87034

Oct 13, 2009

mcpu: Make sure added locks are released on errors · 6f14fc27

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

6f14fc27

Oct 12, 2009

mcpu: Change lock attempt timeout calculation · e3200b18

Michael Hanselmann authored 15 years ago


With this patch all timeouts are pre-calculated. The interface of
the _LockTimeoutStrategy class is also changed a bit; NextAttempt
now returns a new instance.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

e3200b18

Code and docstring style fixes · 69b99987

Michael Hanselmann authored 15 years ago


Found using pylint and epydoc.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

69b99987

mcpu: Improve lock reporting with timeouts · 211b6132

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

211b6132

mcpu: Implement lock timeouts · 407339d0

Michael Hanselmann authored 15 years ago


The timeout is always between ~0.1 and ~10.0 seconds. A small
variation of ±5% is added to prevent different jobs from
fighting each other. After 10 attempts to acquire the locks with
a timeout, a blocking acquire is made.

Lock status reporting will be improved in a separate patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

407339d0

mcpu: Remove unused exclusive_BGL attribute · 6b95b76d

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

6b95b76d

Sep 17, 2009

Remove RpcResult.RemoteFailMsg completely · 3cebe102

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3cebe102

Sep 15, 2009

Keep lock status with every job · ef2df7d3

Michael Hanselmann authored 15 years ago


This can be useful for debugging locking problems.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ef2df7d3