- Dec 29, 2010
-
-
Michael Hanselmann authored
Since the recent change to leave jobs in the “waitlock” status (commit 5fd6b694), cancelling a job while it's back in the queue would break. This patch handles these cases and adds a unittest. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 15, 2010
-
-
Michael Hanselmann authored
Iustin Pop reported that a job's file is updated many times while it waits for locks held by other thread(s). After an investigation it was concluded that the reason was a design decision for job priorities to return jobs to the “queued” status if they couldn't acquire all locks. Changing a jobs' status or priority requires an update to permanent storage. In a high-level view this is what happens: 1. Mark as waitlock 2. Write to disk as permanent storage (jobs left in this state by a crashing master daemon are resumed on restart) 3. Wait for lock (assume lock is held by another thread) 4. Mark as queued 5. Write to disk again 6. Return to workerpool Another option originally discussed was to leave the job in the “waitlock” status. Ignoring priority changes, this is what would happen: 1. If not in waitlock 1.1. Assert state == queued 1.2. Mark as waitlock 1.3. Set start_timestamp 1.4. Write to disk as permanent storage 3. Wait for locks (assume lock is held by another thread) 4. Leave in waitlock 5. Return to workerpool Now let's assume the lock is released by the other thread: […] 3. Wait for locks and get them 4. Assert state == waitlock 5. Set state to running 6. Set exec_timestamp 7. Write to disk As this change reduces the number of writes from two per lock acquire attempt to two per opcode and one per priority increase (as happens after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until the highest priority is reached), here's the patch to implement it. Unittests are updated. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Oct 12, 2010
-
-
Michael Hanselmann authored
If a job was cancelled while it was waiting for locks, an assertion would've failed. This patch fixes the problem and provides a unit test to check for this situation. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Commit 5ef699a0 had to roll back an earlier attempt at implementing this. With the improved job queue processer, this is finally possible. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
These fields can help with debugging. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Oct 07, 2010
-
-
Michael Hanselmann authored
This simplifies the code a bit--the status is only checked once. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 24, 2010
-
-
Michael Hanselmann authored
epydoc complained: “File …/ganeti/jqueue.py, line 886, in ganeti.jqueue._JobProcessor._MarkWaitlock Warning: Redefinition of type for job” Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
As already noted in the design document, an opcode's priority is increased when the lock(s) can't be acquired within a certain amount of time, except at the highest priority, where in such a case a blocking acquire is used. A unittest is provided. Priorities are not yet used for acquiring the lock(s)—this will need further changes on mcpu. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Sep 23, 2010
-
-
Michael Hanselmann authored
This is better to group per-opcode data. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 20, 2010
-
-
Michael Hanselmann authored
In order to support priorities, the processing of jobs needs to be changed. Instead of processing jobs as a whole, the code is changed to process one opcode at a time and then return to the queue. See the Ganeti 2.3 design document for details. This patch does not yet use priorities for acquiring locks. The enclosed unittests increase the test coverage of jqueue.py from about 34% to 58%. Please note that they also test some parts not added by this patch, but testing them became only possible with some infrastructure added by this patch. For the first time, many implications and assumptions for the job queue are codified in these unittests. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
A small helper function is added to make this easier. Priorities are not yet used in all necessary places. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This was forgotten in commit 099b2870. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 16, 2010
-
-
Michael Hanselmann authored
Moving the internals of this function will allow it to be used from unittests in the future. Splitting this into a pure, side-effect free function and an impure one makes the pure function easily testable. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 13, 2010
-
-
Michael Hanselmann authored
Quoting the design document: “Submitted opcodes can have one of the priorities listed below. Other priorities are reserved for internal use”. Submitting jobs at priority -20 should not be allowed. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This allows clients to submit opcodes with a priority. Except for being tracked by the job queue, it is not yet used by any code. Unittests for jqueue._QueuedOpCode and jqueue._QueuedJob are provided for the first time. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This is no longer needed with the new lock monitor. One callback is kept to check for cancelled jobs. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This reverts commit 4008c8ed. While it worked in my initial tests, I've now found cases where this doesn't work properly as it is. More work is needed and will be done as part of the Ganeti 2.3 job queue changes. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 10, 2010
-
-
Michael Hanselmann authored
After an unclean restart of ganeti-masterd, jobs in the “waitlock” status can be safely restarted. They hadn't modified the cluster yet. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
This makes the __init__ function a lot smaller while not changing functionality. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
This reduced the number of updates to the job files. It's used in two places while processing a job and the file is updated just afterwards. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 07, 2010
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
Comes with unittest. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 24, 2010
-
-
Michael Hanselmann authored
With this patch, the task name is added to the thread name and will show up in logs. Log messages from jobs will look like “pid=578/JobQueue14/Job13 mcpu:289 DEBUG LU locks acquired/cluster/BGL/shared”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Aug 19, 2010
-
-
Michael Hanselmann authored
With the job queue changes for Ganeti 2.2, watched and queried jobs are loaded directly from disk, rendering the in-memory “lock_status” field useless. Writing it to disk would be possible, but has a huge cost at runtime (when tested, processing 1'000 opcodes involved 4'000 additional writes to job files, even with replication turned off). Using an additional in-memory dictionary to just manage this field turned out to be a complicated task due to the necessary locking. The plan is to introduce a more generic lock debugging mechanism in the near future. Hence the decision is to remove this field now instead of spending a lot of time to make it working again. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 18, 2010
-
-
Michael Hanselmann authored
When an opcode fails, the job queue would leave following opcodes as “queued”, which can be quite confusing. With this patch, they're all marked as failed and assertions are added to check this. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This is a simplified version of a patch I sent earlier to make sure the job file is only written once with a finalized status. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Manuel Franceschini authored
This patch enables IPv6 name resolution by using socket.getaddrinfo instead of socket.gethostbyname_ex. It renames the HostInfo class to Hostname and unifies its use throughout the code. This is achieved by using static calls where no object is needed and removes some obsolete code. For now, we just resolve to IPv4 addresses, but this will change once it is needed. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 17, 2010
-
-
Michael Hanselmann authored
We can also check when the lock status is updated. This will improve job cancelling. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 30, 2010
-
-
Iustin Pop authored
This patch fixes two issues with job archival. First, the LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add None to the job list; we can't anyway, as this raises an exception: node1# gnt-job archive foo Unhandled protocol error while talking to the master daemon: Caught exception: cannot create weak reference to 'NoneType' object After fixing this, job archival of missing jobs will just continue silently, so we modify gnt-job archive to log jobs which were not archived and to return exit code 1 for any missing jobs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 29, 2010
-
-
Iustin Pop authored
Currently, if a job execution raises a Ganeti-specific error (i.e. subclass of GenericError), then we encode it as (error class, [error args]). This matches the RAPI documentation. However, if we get a non-Ganeti error, then we encode it as simply str(err), a single string. This means that the opresult field is not according to the RAPI docs, and thus it's hard to reliably parse the job results. This patch changes the encoding of a failed job (via failure) to always be an OpExecError, so that we always encode it properly. For the command line interface, the behaviour is the same, as any non-Ganeti errors get re-encoded as OpExecError anyway. For the RAPI clients, it only means that we always present the same type for results. The actual error value is the same, since the err.args is either way str(original_error); compare the original (doesn't contain the ValueError): "opresult": [ "invalid literal for int(): aa" ], with: "opresult": [ [ "OpExecError", [ "invalid literal for int(): aa" ] ] ], Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
By changing it to a normal parameter, which must be a sequence, we can start using keyword parameters. Before this patch all arguments to “AddTask(self, *args)” were passed as arguments to the worker's “RunTask” method. Priorities, which should be optional and will be implemented in a future patch, must be passed as a keyword parameter. This means “*args” can no longer be used as one can't combine *args and keyword parameters in a clean way: >>> def f(name=None, *args): ... print "%r, %r" % (args, name) ... >>> f("p1", "p2", "p3", name="thename") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: f() got multiple values for keyword argument 'name' Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 16, 2010
-
-
Iustin Pop authored
This patch adds lock names to SharedLocks and LockSets, that can be used later for displaying the actual locks being held/used in places where we only have the lock, and not the entire context of the locking operation. Since I realized that the production code doesn't call LockSet with the proper members= syntax, but directly as positional parameters, I've converted this (and the arguments to GlobalLockManager) into positional arguments. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 15, 2010
-
-
Michael Hanselmann authored
By splitting the _WaitForJobChangesHelper class into multiple smaller classes, we gain in several places: - Simpler code, less interaction between functions and variables - Easy to unittest (close to 100% coverage) - Waiting for job changes has no direct knowledge of queue anymore (it doesn't references queue functions anymore, especially not private ones) - Activate inotify only if there was no change at the beginning (and checking again right away to avoid race conditions) Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 12, 2010
-
-
Michael Hanselmann authored
Since the code waiting for job changes was modified to use inotify, a race condition between checking for changes the first time and setting up inotify occurs. If the job is modified after the check but before inotify is active, changes would only be noticed after the timeout (29 seconds in most cases) expired. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 09, 2010
-
-
Manuel Franceschini authored
This patch moves network utility functions to a dedicated module. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 06, 2010
-
-
Iustin Pop authored
With the recent changes in the job queue, an old bug surfaced: we never serialized the status change when in NotifyStart, thus a crash of the master would have left the job queue oblivious to the fact that the job was actually running. In the previous implementation, queries against the job status were using the in-memory object, so they 'saw' and reported correctly the running status. But the new implementation just looks at the on-disk version, and thus didn't see this transition. The patch also moves NotifyStart to a decorator-based version (like the other functions), which generates a lot of churn in the diff, sorry. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-