Commits · c78995f0e2d4660e1cceb36991fd3b55e61aee40 · itminedu / snf-ganeti

Oct 16, 2008

Fix job queue behaviour when loading jobs · 94ed59a5

Iustin Pop authored 16 years ago

Currently, if loading a job fails, the job queue code raises an
exception and prevents the proper processing of the jobs in the queue.
We change this so that unparseable jobs are instead archived (if not
already).

Reviewed-by: imsnah

94ed59a5

Add an interface for the drain flag changes/query · 3ccafd0e

Iustin Pop authored 16 years ago

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain
$ gnt-cluster queue info
The drain flag is set
$ gnt-cluster queue undrain
$ gnt-cluster queue info
The drain flag is unset

The choice of making the setting via luxi and not an opcode is that
opcodes can't be executed when drained, but we don't query via luxi
since in the future it might become a cluster property as opposed to a
node one.

Reviewed-by: imsnah

3ccafd0e

Oct 15, 2008

Implement the job queue drain flag · 686d7433

Iustin Pop authored 16 years ago

We add a (per-node) queue drain flag that blocks new job submission.
There is not yet an interface to add/remove the flag (will come in next
patches).

Reviewed-by: imsnah

686d7433

Oct 10, 2008

Convert rpc module to RpcRunner · 72737a7f

Iustin Pop authored 16 years ago

This big patch changes the call model used in internode-rpc from
standalong function calls in the rpc module to via a RpcRunner class,
that holds all the methods. This can be used in the future to enable
smarter processing in the RPC layer itself (some quick examples are not
setting the DiskID from cmdlib code, but only once in each rpc call,
etc.).

There are a few RPC calls that are made outside of the LU code, and
these calls are left as staticmethods, so they can be used without a
class instance (which requires a ConfigWriter instance).

Reviewed-by: imsnah

72737a7f

Oct 07, 2008

Implement job 'waiting' status · e92376d7

Iustin Pop authored 16 years ago

Background: when we have multiple jobs in the queue (more than just a
few), many of the jobs (up to the number of threads) will be in state
'running', although many of them could be actually blocked, waiting for
some locks. This is not good, as one cannot easily see what is
happening.

The patch extends the opcode/job possible statuses with another one,
waiting, which shows that the LU is in the acquire locks phase. The
mechanism for doing so is simple, we initialize (in the job queue) the
opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
control to the LU's Exec, it will call a notifier back into the
_JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
the proper queue locking). Because this mechanism does not save the job,
all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
so we also change the load sequence to consider WAITLOCK as RUNNING.

With the patch applied, creating in parallel (via burnin) five instances
on a five node cluster shows that only two are executing, while three
are waiting for locks.

Reviewed-by: imsnah

e92376d7

Oct 06, 2008

Implement job auto-archiving · 07cd723a

Iustin Pop authored 16 years ago

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

07cd723a

Increase the number of threads to 25 · 1daae384

Iustin Pop authored 16 years ago

Since our locks are not gathered nicely, we can have jobs that are
actually blocking on locks (parallel burnin shows this), so at least we
need to increase the number of threads above the usual number of jobs we
could have in a such a case.

Reviewed-by: imsnah

1daae384

Sep 30, 2008

Enhance the job-related timestamps · c56ec146

Iustin Pop authored 16 years ago

This patch adds start, stop, and received timestamp for jobs (and allows
querying of them), and allows querying of the opcode timestamps.

Reviewed-by: imsnah

c56ec146

Sep 29, 2008

Add opcode execution log in job info · 5b23c34c

Iustin Pop authored 16 years ago

This patch adds the job execution log in “gnt-job info” and also allows
its selection in “gnt-job list” (however here it's not very useful as
it's not easy to parse). It does this by adding a new field in the query
job call, named ‘oplog’.

With this, one can get a very clear examination of the job. What remains
to be added would be timestamps for start/stop of the processing for the
job itself and its opcodes.

Reviewed-by: imsnah

5b23c34c

Implement job summary in gnt-job list · 60dd1473

Iustin Pop authored 16 years ago

It is not currently possibly to show a summary of the job in the output
of “gnt-job list”. The closes is listing the whole opcode(s), but that
is too verbose. Also, the default output (id, status) is not very
useful, unless one looks for (and knows about) an exact job ID.

The patch adds a “summary” description of a job composed of the list of
OP_ID of the individual opcodes. Moreover, if an opcode has a ‘logical’
target in a certain opcode field (e.g. start instance has the instance
name as the target), then it is included in the formatting also. It's
easier to explain via a sample output:

gnt-job list
ID Status  Summary
1  error   NODE_QUERY
2  success NODE_ADD(gnta2)
3  success CLUSTER_QUERY
4  success NODE_REMOVE(gnta2.example.com)
5  error   NODE_QUERY
6  success NODE_ADD(gnta2)
7  success NODE_QUERY
8  success OS_DIAGNOSE
9  success INSTANCE_CREATE(instance1.example.com)
10 success INSTANCE_REMOVE(instance1.example.com)
11 error   INSTANCE_CREATE(instance1.example.com)
12 success INSTANCE_CREATE(instance1.example.com)
13 success INSTANCE_SHUTDOWN(instance1.example.com)
14 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
15 error   INSTANCE_CREATE(instance2.example.com)
16 error   INSTANCE_CREATE(instance2.example.com)
17 success INSTANCE_CREATE(instance2.example.com)
18 success INSTANCE_ACTIVATE_DISKS(instance1.example.com)
19 success INSTANCE_ACTIVATE_DISKS(instance2.example.com)
20 success INSTANCE_SHUTDOWN(instance1.example.com)
21 success INSTANCE_SHUTDOWN(instance2.example.com)

This is done by a simple change to the opcode classes, which allows an
opcode to format itself. The additional function is small enough that it
can go in opcodes.py, where it could also be used by a client if needed.

Reviewed-by: imsnah

60dd1473

Nicely sort the job list · 3b87986e

Iustin Pop authored 16 years ago

Unless we decide to change the job identifiers to integer, we should at
least sort the list returned by _GetJobIDsUnlocked.

Reviewed-by: imsnah

3b87986e

Sep 10, 2008
- jqueue: Add common RPC error handling function · e74798c1
  Michael Hanselmann authored 16 years ago
```
We didn't decide yet what exactly it should do with failed nodes.

Reviewed-by: ultrotter
```
  e74798c1
Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 27, 2008

jqueue: Replace normal cache dict with weakref dict · 5685c1a5

Michael Hanselmann authored 16 years ago

A job should only exist once in memory. After the cache is cleaned,
there can still be references to a job somewhere else. If there
are multiple instances, one can get updated while a function is
waiting for changes on another instance. By using
weakref.WeakValueDictionary, which automatically removes instances as
soon as there are no strong references to it anymore, we can solve
this problem.

Reviewed-by: iustinp

5685c1a5

jqueue: Keep timestamp of opcode start and end · 70552c46
Michael Hanselmann authored 16 years ago
```
Reviewed-by: ultrotter
```
70552c46
jqueue: Reset run_op_idx after job is done · 65548ed5
Michael Hanselmann authored 16 years ago
```
It can be confusing otherwise.

Reviewed-by: ultrotter
```
65548ed5

Make sure that client programs get all messages · 6c5a7090

Michael Hanselmann authored 16 years ago

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added
too fast, thereby making commands like “gnt-cluster verify” less than
useful. These changes now introduce some sort a serial number per
log entry to keep track what message a client already received. They
also remove the log lock per opcode to make reading log entries thread
safe.

Reviewed-by: ultrotter

6c5a7090

Aug 11, 2008
- Add RPC call to wait for job changes · dfe57c22
  Michael Hanselmann authored 16 years ago
```
This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter
```
  dfe57c22
- jqueue: Change log message time format · d5e317ba
  Michael Hanselmann authored 16 years ago
```
See the comment in the patch.

Reviewed-by: ultrotter
```
  d5e317ba
Aug 08, 2008
- jqueue: Move archived jobs on all nodes · abc1f2ce
  Michael Hanselmann authored 16 years ago
```
Otherwise one might have archived jobs back in the list after a master
failover.

Reviewed-by: iustinp
```
  abc1f2ce
- jstore: Change to not always require a lock · 5d6fb8eb
  Michael Hanselmann authored 16 years ago
```
This way we can do locking when both noded and masterd are running
on the same machine, the latter holding an exclusive lock on the
queue.

Reviewed-by: iustinp
```
  5d6fb8eb
- jqueue: Use new job queue RPC functions · 9f774ee8
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  9f774ee8
Aug 06, 2008

jqueue: Implement {Add,Remove}Node · d2e03a33

Michael Hanselmann authored 16 years ago

These functions will be used to notify the queue about newly added
or removed nodes.

Reviewed-by: iustinp

d2e03a33

jqueue: Don't pass the list of nodes to SubmitJob anymore · 4c848b18

Michael Hanselmann authored 16 years ago

The job queue now maintains its own list and is updated when
nodes are added or removed from the cluster.

Reviewed-by: iustinp

4c848b18

Maintain node list in job queue · 8e00939c
Michael Hanselmann authored 16 years ago
```
The code makes sure not to include the master in the list.

Reviewed-by: iustinp
```
8e00939c

Aug 05, 2008

jqueue: Replicate jobs to all nodes · 23752136

Michael Hanselmann authored 16 years ago

Newly added nodes are not yet taken care of. Queue locking on
non-master nodes is not yet correct.

Reviewed-by: iustinp

23752136

Aug 04, 2008
- jqueue: Use new jstore module · 04ab05ce
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  04ab05ce
Jul 31, 2008

jqueue: Move assert into decorator · db37da70

Michael Hanselmann authored 16 years ago

This reduces code duplication. A later patch will modify the job queue
a bit more and will need a change of this assert. The assertion is
also removed from all class-internal functions.

Reviewed-by: iustinp

db37da70

jqueue: Store context in job queue instead of worker pool · 5bdce580

Michael Hanselmann authored 16 years ago

The job queue will need to access to configuration, which is provided
through the context object, to get a list of nodes.

Reviewed-by: iustinp

5bdce580

Jul 30, 2008

Fix pylint-detected issues · 38206f3c

Iustin Pop authored 16 years ago

This is mostly:
  - whitespace fix (space at EOL in some files, not all, broken
    indentation, etc)
  - variable names overriding others (one is a real bug in there)
  - too-long-lines
  - cleanup of most unused imports (not all)

Reviewed-by: ultrotter

38206f3c

Rewrite job queue · 85f03e0d

Michael Hanselmann authored 16 years ago

We found several issues in the old job queue implementation. It had race
conditions, deadlocks and other deficiencies.

Short summary:
- _QueuedOpCode and _QueuedJob are now more or less data structures with a few
  utility functions. __Setup is gone.
- DiskJobStorage and JobQueue classes merged into one to reduce code complexity.
- One lock in JobQueue for almost everything. There's also a lock per opcode
  for log messages.

Reviewed-by: iustinp

85f03e0d

Jul 29, 2008
- jqueue: Fix error logging · 8090e19f
  Michael Hanselmann authored 16 years ago
```
The passed parameters were not correct.

Reviewed-by: iustinp, ultrotter
```
  8090e19f
Jul 28, 2008
- Implement job canceling on server side · 188c5e0a
  Michael Hanselmann authored 16 years ago
```
Locking is not completeley right due to a deadlock when the job calls
UpdateJob after changing its status.

Reviewed-by: ultrotter
```
  188c5e0a
- Add “canceled” status for opcodes · 4cb1d919
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: ultrotter
```
  4cb1d919
Jul 25, 2008

Move code extracting job ID into function · fae737ac

Michael Hanselmann authored 16 years ago

It might come in handy at some point and makes the code a bit easier
to read.

Reviewed-by: iustinp

fae737ac

Jul 24, 2008

Implement job archiving on the server side · c609f802

Michael Hanselmann authored 16 years ago

So far no error reporting to the client is done. Clients don't get
noticed if a job doesn't exist or couldn't be archived because of
its current status.

The internal cache is always cleaned when the preconditions didn't
fail to make sure that the actual disk status will be reread next
time.

Reviewed-by: iustinp

c609f802

Add directory for archived jobs · 0cb94105
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
0cb94105

Jul 23, 2008

Move code formatting job ID into a base class · ce594241

Michael Hanselmann authored 16 years ago

A later patch will add a memory based job storage class, hence this
code is going into a separate class. It also changes the number format
to always use at least 10 digits, allowing up to 9'999'999'999 jobs to
be sorted without using a custom function.

Reviewed-by: iustinp

ce594241

Rename JobStorage to DiskJobStorage · 21cc1fbd
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
21cc1fbd

Fix logging with string job IDs · 205d71fd

Michael Hanselmann authored 16 years ago

The job ID is now a string, hence logging must use %s instead of %d.

Reviewed-by: iustinp

205d71fd