- 01 Jun, 2010 1 commit
-
-
Iustin Pop authored
Since the current start_timestamp opcode attribute refers to the inital start time, before locks are acquired, it's not useful to determine the actual execution order of two opcodes/jobs competing for the same lock. This patch adds a new field, exec_timestamp, that is updated when the opcode moves from OP_STATUS_WAITLOCK to OP_STATUS_RUNNING, thus allowing a clear view of the execution history. The new field is visible in the job output via the 'opexec' field. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 08 Mar, 2010 2 commits
-
-
Iustin Pop authored
This should remove most of the remaining constructs which can be replaced by PathJoin. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This passes a full burnin with lots of instances, and should be safe as we mostly to join a known root (various constants) to a run-time variable. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 13 Jan, 2010 4 commits
-
-
Michael Hanselmann authored
When the queue was empty, the calculation for unchecked jobs while archiving would return -1. ``last_touched`` is set to 0, the job ID list (``all_job_ids``) is empty. Calculating ``len(all_job_ids) - last_touched - 1`` resulted in -1. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Before it would log something like “starting task (<ganeti.http.client._HttpClientPendingRequest object at 0x2aaaad176790>,)”, which isn't really useful for debugging. Now it'll log “[…] <ganeti.http.client._HttpClientPendingRequest req=<ganeti.http.client.HttpClientRequest 172.24.x.y:1811 PUT /node_info at 0x2aaaaab7ed10> at 0x2aaaaab823d0>”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Having a proper name instead of just a number makes debugging easier. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 04 Jan, 2010 4 commits
-
-
Iustin Pop authored
Many of our functions have to follow a given API, and thus we have to keep a given signature, but pylint doesn't understand this. Therefore, we silence this warning. The patch does a few other cleanups. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
Currently only the rpc call, but not its description (which also shows the argument) is logged. We change this to log failmsg too, and this also silences a warning. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
Many methods are simple pure functions, and not depending on the object state. We convert these to staticmethods. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
Iustin Pop authored
This patch should have only: - pylint disables - docstring changes - whitespace changes Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
- 28 Dec, 2009 1 commit
-
-
Iustin Pop authored
This cherry-picks the utils.FieldSet.Matches changes and the significant jqueue.py change. These are stable in the 2.1 branch and therefore make sense to backport to 2.0 (are basically cleanups). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Olivier Tharan <olive@google.com>
-
- 25 Nov, 2009 1 commit
-
-
Iustin Pop authored
This patch removes the quotes from CommaJoin and converts most of the callers (that I could find) to it. Since CommaJoin does str(i) for i in param, we can remove these, thus simplifying slightly a few calls. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 06 Nov, 2009 2 commits
-
-
Guido Trotter authored
When the processor is executing a job, it can export the execution id to its callers. This is not supported for Queries, as they're not executed in a job. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
This patch adds some silences and tweaks the code slightly so that “pylint --rcfile pylintrc -e ganeti” doesn't give any errors. The biggest change is in jqueue.py, the move of _RequireOpenQueue out of the JobQueue class. Since that is actually a function and not a method (never used as such) this makes sense, and also silences two pylint errors. Another real code change is in utils.py, where FieldSet.Matches will return None instead of False for failure; this still works with the way this class/method is used, and makes more sense (it resembles more closely the re.match return values). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 03 Nov, 2009 1 commit
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 12 Oct, 2009 1 commit
-
-
Michael Hanselmann authored
Found using pylint and epydoc. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 25 Sep, 2009 1 commit
-
-
Iustin Pop authored
Currently, the actual exception raised during an LU execution (one of OpPrereqError, OpExecError, HooksError, etc.) is lost because the jqueue.py code simply sets that to a str(err), and the code in cli.py simply passes that string to OpExecError. This patch moves to encoding the errors as per errors.EncodeError and changes the cli code to parse and raise that (if possible). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com> (cherry picked from commit bcb66fca)
-
- 17 Sep, 2009 1 commit
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 15 Sep, 2009 4 commits
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This can be useful for debugging locking problems. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
There are two major arguments for this: - There will be more callbacks (e.g. for lock debugging) and extending the parameter list is a lot of work. - In the jqueue module this allows us to keep per-job or per-opcode variables in a separate class. Instead of having to clean up the worker class after processing one job, these references will automatically go out of scope. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 07 Sep, 2009 1 commit
-
-
Iustin Pop authored
Currently, on multi-job submits we simply iterate over the single-job-submit function. This means we grab a new serial, write and replicate (and wait for the remote nodes to ack) the serial file, and only then create the job file; this is repeated N times, once for each job. Since job identifiers are ‘cheap’, it's simpler to simply grab at the start a block of new IDs, write and replicate the serial count file a single time, and then proceed with the jobs as before. This is a cheap change that reduces I/O and reduces slightly the CPU consumption of the master daemon: submit time seems to be cut in half for big batches of jobs and the masterd cpu time by (I can't get consistent numbers) between 15%-50%. Note that this doesn't change anything for single-job submits and most probably for < 5 job submits either. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 03 Sep, 2009 1 commit
-
-
Michael Hanselmann authored
This survived QA, burnin and unittests. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Luca Bigliardi <shammash@google.com>
-
- 27 Aug, 2009 1 commit
-
-
Iustin Pop authored
Currently, the actual exception raised during an LU execution (one of OpPrereqError, OpExecError, HooksError, etc.) is lost because the jqueue.py code simply sets that to a str(err), and the code in cli.py simply passes that string to OpExecError. This patch moves to encoding the errors as per errors.EncodeError and changes the cli code to parse and raise that (if possible). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 03 Aug, 2009 2 commits
-
-
Michael Hanselmann authored
When JobQueue.WaitForJobChange gets an invalid or no longer existing job ID it tries to return job_info and log_entries, both of which aren't defined yet. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 19 Jul, 2009 4 commits
-
-
Iustin Pop authored
Currently, unclean master daemon shutdown overwrites all of a job's opcode status and result with error/None. This is incorrect, since the any already finished opcode(s) should have their status and result preserved, and only not-yet-processed opcodes should be marked as ‘error’. Cancelling jobs between opcodes does the same (but this is not allowed currently by the code, so it's not as important as unclean shutdown). This patch adds a new _QueuedJob function that only overwrites the status and result of finalized opcodes, which is then used in job queue init and in the cancel job functions. The patch also adds some comments and a new set constants in constants.py highlighting the finalized vs. non-finalized opcode statuses. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
As a workaround for the job submit timeouts that we have, this patch adds a new luxi call for multi-job submit; the advantage is that all the jobs are added in the queue and only after the workers can start processing them. This is definitely faster than per-job submit, where the submission of new jobs competes with the workers processing jobs. On a pure no-op OpDelay opcode (not on master, not on nodes), we have: - 100 jobs: - individual: submit time ~21s, processing time ~21s - multiple: submit time 7-9s, processing time ~22s - 250 jobs: - individual: submit time ~56s, processing time ~57s run 2: ~54s ~55s - multiple: submit time ~20s, processing time ~51s run 2: ~17s ~52s which shows that we indeed gain on the client side, and maybe even on the total processing time for a high number of jobs. For just 10 or so I expect the difference to be just noise. This will probably require increasing the timeout a little when submitting too many jobs - 250 jobs at ~20 seconds is close to the current rw timeout of 60s. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> (cherry picked from commit 2971c913)
-
Iustin Pop authored
If a job with more than one opcodes is being processed, and the master daemon crashes between two opcodes, we have the first N opcodes marked successful, and the rest marked as queued. This means that the overall jbo status is queued, and thus on master daemon restart it will be resent for completion. However, the RunTask() function in jqueue.py doesn't deal with partially-completed jobs. This patch makes it simply skip such opcodes. An alternative option would be to not mark partially-completed jobs as QUEUED but instead RUNNING, which would result in aborting of the job at restart time. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
In case the job fails, we try to set the job's run_op_idx to -1. However, this is a wrong variable, which wasn't detected until the __slots__ addition. The correct variable is run_op_index. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 17 Jul, 2009 1 commit
-
-
Iustin Pop authored
Adding slots to _QueuedOpCode decreases memory usage (of these objects) by roughly four times. It is a lesser change for _QueuedJobs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 07 Jul, 2009 1 commit
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 15 Jun, 2009 1 commit
-
-
Iustin Pop authored
This patch converts the job queue rpc calls to the new style result. It's done in a single patch as there are helper function (in both jqueue and backend) that are used by multiple rpcs and need synchronized change. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 21 May, 2009 1 commit
-
-
Iustin Pop authored
As a workaround for the job submit timeouts that we have, this patch adds a new luxi call for multi-job submit; the advantage is that all the jobs are added in the queue and only after the workers can start processing them. This is definitely faster than per-job submit, where the submission of new jobs competes with the workers processing jobs. On a pure no-op OpDelay opcode (not on master, not on nodes), we have: - 100 jobs: - individual: submit time ~21s, processing time ~21s - multiple: submit time 7-9s, processing time ~22s - 250 jobs: - individual: submit time ~56s, processing time ~57s run 2: ~54s ~55s - multiple: submit time ~20s, processing time ~51s run 2: ~17s ~52s which shows that we indeed gain on the client side, and maybe even on the total processing time for a high number of jobs. For just 10 or so I expect the difference to be just noise. This will probably require increasing the timeout a little when submitting too many jobs - 250 jobs at ~20 seconds is close to the current rw timeout of 60s. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 12 Feb, 2009 1 commit
-
-
Iustin Pop authored
Currently we only log "Error in opcode ...", but we don't log the error itself. This is not good for debugging. Reviewed-by: ultrotter
-
- 28 Jan, 2009 1 commit
-
-
Iustin Pop authored
This patch fixes two issues with the cancel mechanism: - cancelled jobs show as such, and not in error state (we mark them as OP_STATUS_CANCELED and not OP_STATUS_ERROR) - queued jobs which are cancelled don't raise errors in the master (we treat OP_STATUS_CANCELED now) Reviewed-by: imsnah
-
- 27 Jan, 2009 1 commit
-
-
Iustin Pop authored
This is a simply typo from the conversion to multi-job archiving. Reviewed-by: imsnah
-
- 20 Jan, 2009 1 commit
-
-
Iustin Pop authored
(this is related to the master daemon log) Currently it's not possible to follow (in the non-debug runs) the logical execution thread of jobs. This is due to the fact that we don't log the thread name (so we lose the association of log messages to jobs) and we don't log the start/stop of job and opcode execution. This patch adds a new parameter to utils.SetupLogging that enables thread name logging, and promotes some log entries from debug to info. With this applied, it's easier to understand which log messages relate to which jobs/opcodes. The patch also moves the "INFO client closed connection" entry to debug level, since it's not a very informative log entry. Reviewed-by: ultrotter
-