- Sep 13, 2010
-
-
Michael Hanselmann authored
This is no longer needed with the new lock monitor. One callback is kept to check for cancelled jobs. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This reverts commit 4008c8ed. While it worked in my initial tests, I've now found cases where this doesn't work properly as it is. More work is needed and will be done as part of the Ganeti 2.3 job queue changes. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 10, 2010
-
-
Michael Hanselmann authored
After an unclean restart of ganeti-masterd, jobs in the “waitlock” status can be safely restarted. They hadn't modified the cluster yet. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
This makes the __init__ function a lot smaller while not changing functionality. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
This reduced the number of updates to the job files. It's used in two places while processing a job and the file is updated just afterwards. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Sep 07, 2010
-
-
Michael Hanselmann authored
Comes with unittest. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 24, 2010
-
-
Michael Hanselmann authored
With this patch, the task name is added to the thread name and will show up in logs. Log messages from jobs will look like “pid=578/JobQueue14/Job13 mcpu:289 DEBUG LU locks acquired/cluster/BGL/shared”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Aug 19, 2010
-
-
Michael Hanselmann authored
With the job queue changes for Ganeti 2.2, watched and queried jobs are loaded directly from disk, rendering the in-memory “lock_status” field useless. Writing it to disk would be possible, but has a huge cost at runtime (when tested, processing 1'000 opcodes involved 4'000 additional writes to job files, even with replication turned off). Using an additional in-memory dictionary to just manage this field turned out to be a complicated task due to the necessary locking. The plan is to introduce a more generic lock debugging mechanism in the near future. Hence the decision is to remove this field now instead of spending a lot of time to make it working again. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 18, 2010
-
-
Michael Hanselmann authored
When an opcode fails, the job queue would leave following opcodes as “queued”, which can be quite confusing. With this patch, they're all marked as failed and assertions are added to check this. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This is a simplified version of a patch I sent earlier to make sure the job file is only written once with a finalized status. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Aug 17, 2010
-
-
Michael Hanselmann authored
We can also check when the lock status is updated. This will improve job cancelling. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 30, 2010
-
-
Iustin Pop authored
This patch fixes two issues with job archival. First, the LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add None to the job list; we can't anyway, as this raises an exception: node1# gnt-job archive foo Unhandled protocol error while talking to the master daemon: Caught exception: cannot create weak reference to 'NoneType' object After fixing this, job archival of missing jobs will just continue silently, so we modify gnt-job archive to log jobs which were not archived and to return exit code 1 for any missing jobs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 29, 2010
-
-
Iustin Pop authored
Currently, if a job execution raises a Ganeti-specific error (i.e. subclass of GenericError), then we encode it as (error class, [error args]). This matches the RAPI documentation. However, if we get a non-Ganeti error, then we encode it as simply str(err), a single string. This means that the opresult field is not according to the RAPI docs, and thus it's hard to reliably parse the job results. This patch changes the encoding of a failed job (via failure) to always be an OpExecError, so that we always encode it properly. For the command line interface, the behaviour is the same, as any non-Ganeti errors get re-encoded as OpExecError anyway. For the RAPI clients, it only means that we always present the same type for results. The actual error value is the same, since the err.args is either way str(original_error); compare the original (doesn't contain the ValueError): "opresult": [ "invalid literal for int(): aa" ], with: "opresult": [ [ "OpExecError", [ "invalid literal for int(): aa" ] ] ], Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
By changing it to a normal parameter, which must be a sequence, we can start using keyword parameters. Before this patch all arguments to “AddTask(self, *args)” were passed as arguments to the worker's “RunTask” method. Priorities, which should be optional and will be implemented in a future patch, must be passed as a keyword parameter. This means “*args” can no longer be used as one can't combine *args and keyword parameters in a clean way: >>> def f(name=None, *args): ... print "%r, %r" % (args, name) ... >>> f("p1", "p2", "p3", name="thename") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: f() got multiple values for keyword argument 'name' Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 16, 2010
-
-
Iustin Pop authored
This patch adds lock names to SharedLocks and LockSets, that can be used later for displaying the actual locks being held/used in places where we only have the lock, and not the entire context of the locking operation. Since I realized that the production code doesn't call LockSet with the proper members= syntax, but directly as positional parameters, I've converted this (and the arguments to GlobalLockManager) into positional arguments. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 15, 2010
-
-
Michael Hanselmann authored
By splitting the _WaitForJobChangesHelper class into multiple smaller classes, we gain in several places: - Simpler code, less interaction between functions and variables - Easy to unittest (close to 100% coverage) - Waiting for job changes has no direct knowledge of queue anymore (it doesn't references queue functions anymore, especially not private ones) - Activate inotify only if there was no change at the beginning (and checking again right away to avoid race conditions) Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 12, 2010
-
-
Michael Hanselmann authored
Since the code waiting for job changes was modified to use inotify, a race condition between checking for changes the first time and setting up inotify occurs. If the job is modified after the check but before inotify is active, changes would only be noticed after the timeout (29 seconds in most cases) expired. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 09, 2010
-
-
Manuel Franceschini authored
This patch moves network utility functions to a dedicated module. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 06, 2010
-
-
Iustin Pop authored
With the recent changes in the job queue, an old bug surfaced: we never serialized the status change when in NotifyStart, thus a crash of the master would have left the job queue oblivious to the fact that the job was actually running. In the previous implementation, queries against the job status were using the in-memory object, so they 'saw' and reported correctly the running status. But the new implementation just looks at the on-disk version, and thus didn't see this transition. The patch also moves NotifyStart to a decorator-based version (like the other functions), which generates a lot of churn in the diff, sorry. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jun 28, 2010
-
-
Guido Trotter authored
By using ssynchronized in the new way, we can remove the module-global _big_jqueue_lock and revert back to an internal _lock inside the jqueue. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
We can share the jqueue lock when we do per-job updates. These only conflict with updates/checks on the same job from another thread (eg. CancelJob, ArchiveJob, which keep the lock unshared, since they are less frequent). Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Move some code to a decorated function rather than explicitely acquiring/releasing the lock in AppendFeedback. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Remove the jqueue _lock member and convert to a _big_jqueue_lock sharedlock. This allows smooth transition from the old single lock to a more granular approach. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Every time we call MarkUnfinishedOps we do it in a try/finally block that updates the job file. With this patch we move the try/finally inside. CancelJobUnlocked is removed, because it just becomes a wrapper over MarkUnfinishedOps with two constant values. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 23, 2010
-
-
Guido Trotter authored
We don't need it anymore, since nobody waits on it. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
As for QueryJobs we rely on file updates rather than condition notification to acquire job changes. In order to do that we use the pyinotify module to watch files. This might make the client a bit slower (pending planned improvements, such as subscription-based WaitForJobChanges) but detaches it from the job execution. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
This is needed to convert waitforjobchanges to use inotify and the on-disk version and decouple it from the job queue lock. No replication to remote nodes is done, to keep the operation fast. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
We move from querying the in-memory version to loading all jobs from the disk. Since the jobs are written/deleted on disk in an atomic manner, we don't need to lock at all. Also, since we're just looking at the contents of a directory, we don't need to check that the job queue is "open". If some jobs are removed between when we listed them and us loading them, we need to be able to cope: if we were asked to load those jobs specifically, we must report the failure, but if we were just asked to "load all" we shall just not consider them as part of the "all" set, since they were deleted. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
This will be used to read a job file without having to deal with exceptions from _LoadJobFromDisk. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Currently _LoadJobFromDisk archives job files it finds corrupted. Since we want to use it to load files without holding locks, this could cause a conflict: we just move the feature to _LoadJobUnlocked which is always called with the lock held. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 17, 2010
-
-
Guido Trotter authored
Rather than adding the jobs to the worker pool one at a time, we add them all together, which is slightly faster, and ensures they don't get started while we loop. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
Sometimes it's useful to write to the local filesystem, but immediate replication to all master candidates is not needed. The _WriteAndReplicateFileUnlocked function gets renamed to _UpdateJobQueueFile, as calling "write and replicate, but don't replicate" seemed a bit strange. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
The job queue currently has a static _GetJobInfoUnlocked method. Changing it to be a normal method of _QueuedJob, which makes more sense. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
Move the work from _LoadJobUnlocked to _LoadJobFileFromDisk, which can then be used in other contexts as well. Also, if we fail to deserialize the job, archive it as well (before we archived it only if we failed to create the related object, but kept it there if deserialization failed. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jun 15, 2010
-
-
Guido Trotter authored
Among all users, turns out just one *may* need the output to be sorted. All the others can cope without. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Somewhere we do try/del/except and somewhere just pop. Using pop everywhere saves lines of code. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 11, 2010
-
-
Guido Trotter authored
Currently each time we submit a job we check the job queue size, and the drained file. With this change we keep these pieces of information in memory and don't read them from the filesystem each time. Significant changes include: - The drained value can only be properly set by calling the appropriate cluster command "gnt-cluster queue drain/undrain" and not by removing/creating the file in the job queue directory. Not that anybody would have done it in this undocumented way before. - We get rid of the soft limit for the job queue, which we haven't ever used anyway. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
The name clarifies the difference between this and the internal lock. Also explain a bit better what it is. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-