- Apr 06, 2011
-
-
Iustin Pop authored
This has been observed to cause problems on real clusters via the following mechanism: - a long job (e.g. a replace-disks) is keeping an exclusive lock on an instance - the watcher starts and submits its query instances opcode which wants shared locks for all instances - after about an hour, the watcher job falls back to blocking acquire, after having acquired all other locks - any instance opcode that wants an exclusive lock for an instance cannot start until the watcher has finished, even though there's no actual operation on that instance In order to alleviate this problem, we simply increase the max timeout until lock acquires are sent back to either blocking acquire or priority increase. The timeout is computed such that we wait ~10 hours (instead of one) for this to happen, which should be within the maximum lifetime of a reasonable opcode on a healthy cluster. The timeout also means that priority increases will happen every half hour. We also increase the max wait interval to 15 seconds, otherwise we'd have too many retries with the increased interval. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Feb 28, 2011
-
-
Michael Hanselmann authored
The exception was never actually raised. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Adeodato Simo <dato@google.com>
-
- Jan 10, 2011
-
-
Iustin Pop authored
While reviewing dato's interdiff for the OpAssignGroupNodes, I realised that we can do better. This patch replaces the hand-built DISPATCH_TABLE with one built from the opcode.OP_MAPPING dict. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Dec 15, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 13, 2010
-
-
Adeodato Simo authored
With this commit, only modification of the "ndparams" attribute is supported. Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Dec 08, 2010
-
-
Adeodato Simo authored
Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Adeodato Simo authored
Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Adeodato Simo authored
Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Dec 07, 2010
-
-
René Nussbaumer authored
Register OpCode and Logical Unit in mcpu.py Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Dec 01, 2010
-
-
Adeodato Simo authored
This adds opcodes.OpQueryGroups and cmdlib.LUQueryGroups. Signed-off-by:
Adeodato Simo <dato@google.com> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 29, 2010
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 16, 2010
-
-
René Nussbaumer authored
As we need this functionality in other places than just locking it makes sense to move it to utils rather than keeping it in locking Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Oct 12, 2010
-
-
Michael Hanselmann authored
Removes code duplication. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 24, 2010
-
-
Michael Hanselmann authored
Until now the priority for lock acquires couldn't be passed when running opcodes. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Sep 23, 2010
-
-
Michael Hanselmann authored
The changes to job queue processing require some changes on this class' interface. LockAttemptTimeoutStrategy might move to another place, but that'll be done in a later patch. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Right now the timeout is not passed by any caller, making the code effectively go back to blocking acquires. Since the timeout is always None, no caller needs to be changed in this patch. This change also means that any LUXI query handled by ganeti-masterd will use blocking acquires if they need locks (only the case for getting tags). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Sep 13, 2010
-
-
Michael Hanselmann authored
This is no longer needed with the new lock monitor. One callback is kept to check for cancelled jobs. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 15, 2010
-
-
Michael Hanselmann authored
This new opcode and gnt-debug sub-command test some aspects of the job queue, including the status of a job. The bug fixed in commit 2034c70d was identified using this test. A future patch will run this test automatically from the QA scripts. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 12, 2010
-
-
Michael Hanselmann authored
By exposing mcpu's _Feedback function (now renamed to “Log”) to LU's, methods like ExpandNames can also write to the job execution log. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 23, 2010
-
-
Iustin Pop authored
All code has been switched to the new-style LU… time for cleanup. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 18, 2010
-
-
Michael Hanselmann authored
To prepare a remote export, the X509 key and certificate need to be generated. A handshake value is also returned for an easier check whether both clusters share the same cluster domain secret. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 22, 2010
-
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jan 25, 2010
-
-
Iustin Pop authored
This might fix issue 84; in any case, the current situation is that we have a potentially unsafe formatting, which should be fixed. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jan 13, 2010
-
-
Michael Hanselmann authored
Reading and comparing sorted lists is easier when debugging locking problems. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 04, 2010
-
-
Iustin Pop authored
This patch should have only: - pylint disables - docstring changes - whitespace changes Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
Note there are some cases left which need extra cleanup. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
- Dec 28, 2009
-
-
Iustin Pop authored
This patch adds targeted pylint disables, where it makes sense (either due to limitations in pylint or due to historical usage), and also a few blanket ones in rapi where all the names are… “different”. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
- Nov 06, 2009
-
-
Guido Trotter authored
For now this function does nothing, but it gets called by mcpu when the execution of an LU is done, making sure any pending reservations are dropped. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
When the processor is executing a job, it can export the execution id to its callers. This is not supported for Queries, as they're not executed in a job. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 02, 2009
-
-
Iustin Pop authored
This finishes the conversion of OpPrereqError creation to two-argument style. Any leftovers as one-argument are not breaking anything, just losing information about the errors. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Oct 15, 2009
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Oct 13, 2009
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Oct 12, 2009
-
-
Michael Hanselmann authored
With this patch all timeouts are pre-calculated. The interface of the _LockTimeoutStrategy class is also changed a bit; NextAttempt now returns a new instance. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
Found using pylint and epydoc. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
The timeout is always between ~0.1 and ~10.0 seconds. A small variation of ±5% is added to prevent different jobs from fighting each other. After 10 attempts to acquire the locks with a timeout, a blocking acquire is made. Lock status reporting will be improved in a separate patch. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Sep 17, 2009
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 15, 2009
-
-
Michael Hanselmann authored
This can be useful for debugging locking problems. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-