- Oct 16, 2008
-
-
Iustin Pop authored
Currently, if loading a job fails, the job queue code raises an exception and prevents the proper processing of the jobs in the queue. We change this so that unparseable jobs are instead archived (if not already). Reviewed-by: imsnah
-
Iustin Pop authored
This adds the set/reset in the jqueue and luxi modules, and a way to query it in OpQueryConfigValues, and also the comand line interface for it: $ gnt-cluster queue info The drain flag is unset $ gnt-cluster queue drain $ gnt-cluster queue info The drain flag is set $ gnt-cluster queue undrain $ gnt-cluster queue info The drain flag is unset The choice of making the setting via luxi and not an opcode is that opcodes can't be executed when drained, but we don't query via luxi since in the future it might become a cluster property as opposed to a node one. Reviewed-by: imsnah
-
- Oct 15, 2008
-
-
Iustin Pop authored
We add a (per-node) queue drain flag that blocks new job submission. There is not yet an interface to add/remove the flag (will come in next patches). Reviewed-by: imsnah
-
- Oct 10, 2008
-
-
Iustin Pop authored
This big patch changes the call model used in internode-rpc from standalong function calls in the rpc module to via a RpcRunner class, that holds all the methods. This can be used in the future to enable smarter processing in the RPC layer itself (some quick examples are not setting the DiskID from cmdlib code, but only once in each rpc call, etc.). There are a few RPC calls that are made outside of the LU code, and these calls are left as staticmethods, so they can be used without a class instance (which requires a ConfigWriter instance). Reviewed-by: imsnah
-
- Oct 07, 2008
-
-
Iustin Pop authored
Background: when we have multiple jobs in the queue (more than just a few), many of the jobs (up to the number of threads) will be in state 'running', although many of them could be actually blocked, waiting for some locks. This is not good, as one cannot easily see what is happening. The patch extends the opcode/job possible statuses with another one, waiting, which shows that the LU is in the acquire locks phase. The mechanism for doing so is simple, we initialize (in the job queue) the opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give control to the LU's Exec, it will call a notifier back into the _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with the proper queue locking). Because this mechanism does not save the job, all opcodes on disk will be in status WAITLOCK and not RUNNING anymore, so we also change the load sequence to consider WAITLOCK as RUNNING. With the patch applied, creating in parallel (via burnin) five instances on a five node cluster shows that only two are executing, while three are waiting for locks. Reviewed-by: imsnah
-
- Oct 06, 2008
-
-
Iustin Pop authored
This patch adds a new luxi call that implements auto-archiving of jobs older than a certain age (or -1 for all completed jobs), and the gnt-job command that makes use of this (with 'all' for -1). Reviewed-by: imsnah
-
Iustin Pop authored
Since our locks are not gathered nicely, we can have jobs that are actually blocking on locks (parallel burnin shows this), so at least we need to increase the number of threads above the usual number of jobs we could have in a such a case. Reviewed-by: imsnah
-
- Sep 30, 2008
-
-
Iustin Pop authored
This patch adds start, stop, and received timestamp for jobs (and allows querying of them), and allows querying of the opcode timestamps. Reviewed-by: imsnah
-
- Sep 29, 2008
-
-
Iustin Pop authored
This patch adds the job execution log in “gnt-job info” and also allows its selection in “gnt-job list” (however here it's not very useful as it's not easy to parse). It does this by adding a new field in the query job call, named ‘oplog’. With this, one can get a very clear examination of the job. What remains to be added would be timestamps for start/stop of the processing for the job itself and its opcodes. Reviewed-by: imsnah
-
Iustin Pop authored
It is not currently possibly to show a summary of the job in the output of “gnt-job list”. The closes is listing the whole opcode(s), but that is too verbose. Also, the default output (id, status) is not very useful, unless one looks for (and knows about) an exact job ID. The patch adds a “summary” description of a job composed of the list of OP_ID of the individual opcodes. Moreover, if an opcode has a ‘logical’ target in a certain opcode field (e.g. start instance has the instance name as the target), then it is included in the formatting also. It's easier to explain via a sample output: gnt-job list ID Status Summary 1 error NODE_QUERY 2 success NODE_ADD(gnta2) 3 success CLUSTER_QUERY 4 success NODE_REMOVE(gnta2.example.com) 5 error NODE_QUERY 6 success NODE_ADD(gnta2) 7 success NODE_QUERY 8 success OS_DIAGNOSE 9 success INSTANCE_CREATE(instance1.example.com) 10 success INSTANCE_REMOVE(instance1.example.com) 11 error INSTANCE_CREATE(instance1.example.com) 12 success INSTANCE_CREATE(instance1.example.com) 13 success INSTANCE_SHUTDOWN(instance1.example.com) 14 success INSTANCE_ACTIVATE_DISKS(instance1.example.com) 15 error INSTANCE_CREATE(instance2.example.com) 16 error INSTANCE_CREATE(instance2.example.com) 17 success INSTANCE_CREATE(instance2.example.com) 18 success INSTANCE_ACTIVATE_DISKS(instance1.example.com) 19 success INSTANCE_ACTIVATE_DISKS(instance2.example.com) 20 success INSTANCE_SHUTDOWN(instance1.example.com) 21 success INSTANCE_SHUTDOWN(instance2.example.com) This is done by a simple change to the opcode classes, which allows an opcode to format itself. The additional function is small enough that it can go in opcodes.py, where it could also be used by a client if needed. Reviewed-by: imsnah
-
Iustin Pop authored
Unless we decide to change the job identifiers to integer, we should at least sort the list returned by _GetJobIDsUnlocked. Reviewed-by: imsnah
-
- Sep 10, 2008
-
-
Michael Hanselmann authored
We didn't decide yet what exactly it should do with failed nodes. Reviewed-by: ultrotter
-
- Aug 29, 2008
-
-
Iustin Pop authored
This patch alters the WaitForJobChanges luxi-RPC call to have a configurable timeout, so that the call behaves nicely with long jobs that have no update. We do this by adding a timeout parameter in the RPC call, and returning a special constant when the timeout is reached without an update. The luxi client will repeatedly call the WaitForJobChanges until it gets a real change. The timeout is hardcoded as half the RWTO value. The patch also removes an unused variable (new_state) from the WaitForJobChanges method. Reviewed-by: imsnah,ultrotter
-
- Aug 27, 2008
-
-
Michael Hanselmann authored
A job should only exist once in memory. After the cache is cleaned, there can still be references to a job somewhere else. If there are multiple instances, one can get updated while a function is waiting for changes on another instance. By using weakref.WeakValueDictionary, which automatically removes instances as soon as there are no strong references to it anymore, we can solve this problem. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: ultrotter
-
Michael Hanselmann authored
It can be confusing otherwise. Reviewed-by: ultrotter
-
Michael Hanselmann authored
This is a large patch, but I can't figure out how to split it without breaking stuff. The old way of getting messages by always getting the last one didn't bring all messages to the client if they were added too fast, thereby making commands like “gnt-cluster verify” less than useful. These changes now introduce some sort a serial number per log entry to keep track what message a client already received. They also remove the log lock per opcode to make reading log entries thread safe. Reviewed-by: ultrotter
-
- Aug 11, 2008
-
-
Michael Hanselmann authored
This way clients can react faster to status or message changes and don't have to poll anymore. Reviewed-by: ultrotter
-
Michael Hanselmann authored
See the comment in the patch. Reviewed-by: ultrotter
-
- Aug 08, 2008
-
-
Michael Hanselmann authored
Otherwise one might have archived jobs back in the list after a master failover. Reviewed-by: iustinp
-
Michael Hanselmann authored
This way we can do locking when both noded and masterd are running on the same machine, the latter holding an exclusive lock on the queue. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
- Aug 06, 2008
-
-
Michael Hanselmann authored
These functions will be used to notify the queue about newly added or removed nodes. Reviewed-by: iustinp
-
Michael Hanselmann authored
The job queue now maintains its own list and is updated when nodes are added or removed from the cluster. Reviewed-by: iustinp
-
Michael Hanselmann authored
The code makes sure not to include the master in the list. Reviewed-by: iustinp
-
- Aug 05, 2008
-
-
Michael Hanselmann authored
Newly added nodes are not yet taken care of. Queue locking on non-master nodes is not yet correct. Reviewed-by: iustinp
-
- Aug 04, 2008
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
- Jul 31, 2008
-
-
Michael Hanselmann authored
This reduces code duplication. A later patch will modify the job queue a bit more and will need a change of this assert. The assertion is also removed from all class-internal functions. Reviewed-by: iustinp
-
Michael Hanselmann authored
The job queue will need to access to configuration, which is provided through the context object, to get a list of nodes. Reviewed-by: iustinp
-
- Jul 30, 2008
-
-
Iustin Pop authored
This is mostly: - whitespace fix (space at EOL in some files, not all, broken indentation, etc) - variable names overriding others (one is a real bug in there) - too-long-lines - cleanup of most unused imports (not all) Reviewed-by: ultrotter
-
Michael Hanselmann authored
We found several issues in the old job queue implementation. It had race conditions, deadlocks and other deficiencies. Short summary: - _QueuedOpCode and _QueuedJob are now more or less data structures with a few utility functions. __Setup is gone. - DiskJobStorage and JobQueue classes merged into one to reduce code complexity. - One lock in JobQueue for almost everything. There's also a lock per opcode for log messages. Reviewed-by: iustinp
-
- Jul 29, 2008
-
-
Michael Hanselmann authored
The passed parameters were not correct. Reviewed-by: iustinp, ultrotter
-
- Jul 28, 2008
-
-
Michael Hanselmann authored
Locking is not completeley right due to a deadlock when the job calls UpdateJob after changing its status. Reviewed-by: ultrotter
-
Michael Hanselmann authored
Reviewed-by: ultrotter
-
- Jul 25, 2008
-
-
Michael Hanselmann authored
It might come in handy at some point and makes the code a bit easier to read. Reviewed-by: iustinp
-
- Jul 24, 2008
-
-
Michael Hanselmann authored
So far no error reporting to the client is done. Clients don't get noticed if a job doesn't exist or couldn't be archived because of its current status. The internal cache is always cleaned when the preconditions didn't fail to make sure that the actual disk status will be reread next time. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
- Jul 23, 2008
-
-
Michael Hanselmann authored
A later patch will add a memory based job storage class, hence this code is going into a separate class. It also changes the number format to always use at least 10 digits, allowing up to 9'999'999'999 jobs to be sorted without using a custom function. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
The job ID is now a string, hence logging must use %s instead of %d. Reviewed-by: iustinp
-