Commits · a9d68e400b2577dfc111662af4a701077aff8dfe · itminedu / snf-ganeti

Dec 29, 2010

jqueue: Fix cancelling while in waitlock in queue · 30c945d0

Michael Hanselmann authored 14 years ago


Since the recent change to leave jobs in the “waitlock” status (commit
5fd6b694), cancelling a job while it's back in the queue would break.
This patch handles these cases and adds a unittest.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

30c945d0

Dec 15, 2010

jqueue: Keep jobs in “waitlock” while returning to queue · 5fd6b694

Michael Hanselmann authored 14 years ago


Iustin Pop reported that a job's file is updated many times while it
waits for locks held by other thread(s). After an investigation it was
concluded that the reason was a design decision for job priorities to
return jobs to the “queued” status if they couldn't acquire all locks.
Changing a jobs' status or priority requires an update to permanent
storage.

In a high-level view this is what happens:
1. Mark as waitlock
2. Write to disk as permanent storage (jobs left in this state by a
   crashing master daemon are resumed on restart)
3. Wait for lock (assume lock is held by another thread)
4. Mark as queued
5. Write to disk again
6. Return to workerpool

Another option originally discussed was to leave the job in the
“waitlock” status. Ignoring priority changes, this is what would happen:
1. If not in waitlock
1.1. Assert state == queued
1.2. Mark as waitlock
1.3. Set start_timestamp
1.4. Write to disk as permanent storage
3. Wait for locks (assume lock is held by another thread)
4. Leave in waitlock
5. Return to workerpool

Now let's assume the lock is released by the other thread:
[…]
3. Wait for locks and get them
4. Assert state == waitlock
5. Set state to running
6. Set exec_timestamp
7. Write to disk

As this change reduces the number of writes from two per lock acquire
attempt to two per opcode and one per priority increase (as happens
after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until
the highest priority is reached), here's the patch to implement it.
Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5fd6b694

Oct 12, 2010

jqueue: Fix bug when cancelling jobs · 9e49dfc5

Michael Hanselmann authored 14 years ago


If a job was cancelled while it was waiting for locks, an assertion
would've failed. This patch fixes the problem and provides a unit
test to check for this situation.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9e49dfc5

jqueue: Resume jobs from “waitlock” status (2nd try) · 320d1daf

Michael Hanselmann authored 14 years ago


Commit 5ef699a0 had to roll back an earlier attempt at implementing
this. With the improved job queue processer, this is finally possible.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

320d1daf

jqueue/gnt-job: Add job priority fields for display · b8802cc4

Michael Hanselmann authored 14 years ago


These fields can help with debugging.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b8802cc4

Oct 07, 2010

jqueue, CancelJob: Check status only once per call · 86b16e9d

Michael Hanselmann authored 14 years ago


This simplifies the code a bit--the status is only checked once.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

86b16e9d

Sep 24, 2010

Fix docstring typo in jqueue._JobProcessor._MarkWaitlock · a38e8674

Michael Hanselmann authored 14 years ago


epydoc complained:
“File …/ganeti/jqueue.py, line 886, in
ganeti.jqueue._JobProcessor._MarkWaitlock
  Warning: Redefinition of type for job”

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a38e8674

jqueue: Use priority for acquiring locks · f23db633

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f23db633

jqueue: Use timeout when acquiring locks · 26d3fd2f

Michael Hanselmann authored 14 years ago


As already noted in the design document, an opcode's priority is
increased when the lock(s) can't be acquired within a certain amount of
time, except at the highest priority, where in such a case a blocking
acquire is used.

A unittest is provided. Priorities are not yet used for acquiring the
lock(s)—this will need further changes on mcpu.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

26d3fd2f

Sep 23, 2010

jqueue: Introduce per-opcode context object · b80cc518

Michael Hanselmann authored 14 years ago


This is better to group per-opcode data.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

b80cc518

jqueue: Rename current_op to better reflect what it actually is · 03b63608
Michael Hanselmann authored 14 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
03b63608
jqueue: Separate function for in-memory variables · fa4aa6b4
Michael Hanselmann authored 14 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
```
fa4aa6b4

Sep 20, 2010

jqueue: Change model from per-job to per-opcode processing · be760ba8

Michael Hanselmann authored 14 years ago


In order to support priorities, the processing of jobs needs to be
changed. Instead of processing jobs as a whole, the code is changed to
process one opcode at a time and then return to the queue. See the
Ganeti 2.3 design document for details.

This patch does not yet use priorities for acquiring locks.

The enclosed unittests increase the test coverage of jqueue.py from
about 34% to 58%. Please note that they also test some parts not added
by this patch, but testing them became only possible with some
infrastructure added by this patch. For the first time, many
implications and assumptions for the job queue are codified in these
unittests.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

be760ba8

jqueue: Use priority for worker pool · 7b5c4a69

Michael Hanselmann authored 14 years ago


A small helper function is added to make this easier. Priorities are not
yet used in all necessary places.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7b5c4a69

jqueue: Add missing docstring to _QueuedJob.Cancel · a0d2fe2c

Michael Hanselmann authored 14 years ago


This was forgotten in commit 099b2870.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a0d2fe2c

Sep 16, 2010

jqueue: Move CancelJob logic to separate function · 099b2870

Michael Hanselmann authored 14 years ago


Moving the internals of this function will allow it to be used from
unittests in the future. Splitting this into a pure, side-effect free
function and an impure one makes the pure function easily testable.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

099b2870

Sep 13, 2010

jqueue: Ensure only accepted priorities are allowed for submitting jobs · e71c8147

Michael Hanselmann authored 14 years ago


Quoting the design document: “Submitted opcodes can have one of the priorities
listed below. Other priorities are reserved for internal use”. Submitting jobs
at priority -20 should not be allowed.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e71c8147

Add support for job priority to opcodes and job queue objects · 8f5c488d

Michael Hanselmann authored 14 years ago


This allows clients to submit opcodes with a priority. Except for being
tracked by the job queue, it is not yet used by any code.

Unittests for jqueue._QueuedOpCode and jqueue._QueuedJob are provided for
the first time.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

8f5c488d

Remove mcpu's ReportLocks callback · acf931b7

Michael Hanselmann authored 14 years ago


This is no longer needed with the new lock monitor. One callback is kept to
check for cancelled jobs.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

acf931b7

Revert "jqueue: Resume jobs from “waitlock” status" · 5ef699a0

Michael Hanselmann authored 14 years ago


This reverts commit 4008c8ed.

While it worked in my initial tests, I've now found cases where this doesn't
work properly as it is. More work is needed and will be done as part of the
Ganeti 2.3 job queue changes.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5ef699a0

Sep 10, 2010

jqueue: Resume jobs from “waitlock” status · 4008c8ed

Michael Hanselmann authored 14 years ago


After an unclean restart of ganeti-masterd, jobs in the “waitlock” status can
be safely restarted. They hadn't modified the cluster yet.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

4008c8ed

jqueue: Move queue inspection into separate function · de9d02c7

Michael Hanselmann authored 14 years ago


This makes the __init__ function a lot smaller while not changing
functionality.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

de9d02c7

jqueue: Don't update file in MarkUnfinishedOps · 747f6113

Michael Hanselmann authored 14 years ago


This reduced the number of updates to the job files. It's used in two places
while processing a job and the file is updated just afterwards.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

747f6113

Sep 07, 2010

Move job queue to new ganeti.runtime · 82b22e19

René Nussbaumer authored 14 years ago


Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

82b22e19

jqueue: Use separate function for encoding errors · 6760e4ed

Michael Hanselmann authored 14 years ago


Comes with unittest.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6760e4ed

Aug 24, 2010

workerpool: Allow setting task name · daba67c7

Michael Hanselmann authored 14 years ago

With this patch, the task name is added to the thread name and will show up in
logs. Log messages from jobs will look like “pid=578/JobQueue14/Job13 mcpu:289
DEBUG LU locks acquired/cluster/BGL/shared”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

daba67c7

Aug 19, 2010

jqueue: Remove lock status field · 9bdab621

Michael Hanselmann authored 14 years ago

With the job queue changes for Ganeti 2.2, watched and queried jobs are
loaded directly from disk, rendering the in-memory “lock_status” field
useless. Writing it to disk would be possible, but has a huge cost at
runtime (when tested, processing 1'000 opcodes involved 4'000 additional
writes to job files, even with replication turned off).

Using an additional in-memory dictionary to just manage this field turned
out to be a complicated task due to the necessary locking.

The plan is to introduce a more generic lock debugging mechanism in the
near future. Hence the decision is to remove this field now instead of
spending a lot of time to make it working again.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9bdab621

Aug 18, 2010

jqueue: Mark opcodes following failed ones as failed, too · 963a068b

Michael Hanselmann authored 14 years ago


When an opcode fails, the job queue would leave following opcodes as “queued”,
which can be quite confusing. With this patch, they're all marked as failed and
assertions are added to check this.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

963a068b

jqueue: Work around race condition between job processing and archival · 6ea72e43

Michael Hanselmann authored 14 years ago


This is a simplified version of a patch I sent earlier to make sure the job
file is only written once with a finalized status.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6ea72e43

Support for resolving hostnames to IPv6 addresses · b705c7a6

Manuel Franceschini authored 14 years ago


This patch enables IPv6 name resolution by using socket.getaddrinfo
instead of socket.gethostbyname_ex.

It renames the HostInfo class to Hostname and unifies its use throughout
the code. This is achieved by using static calls where no object is
needed and removes some obsolete code.

For now, we just resolve to IPv4 addresses, but this will change once it
is needed.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b705c7a6

Aug 17, 2010

jqueue: More checks for cancelling queued job · dc1e2262

Michael Hanselmann authored 14 years ago


We can also check when the lock status is updated. This will
improve job cancelling.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

dc1e2262

jqueue: Add more debug output · e35344b4

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e35344b4

Jul 30, 2010

Fix a few job archival issues · aa9f8167

Iustin Pop authored 14 years ago


This patch fixes two issues with job archival. First, the
LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add
None to the job list; we can't anyway, as this raises an exception:

  node1# gnt-job archive foo
  Unhandled protocol error while talking to the master daemon:
  Caught exception: cannot create weak reference to 'NoneType' object

After fixing this, job archival of missing jobs will just continue
silently, so we modify gnt-job archive to log jobs which were not
archived and to return exit code 1 for any missing jobs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

aa9f8167

Jul 29, 2010

Change handling of non-Ganeti errors in jqueue · 599ee321

Iustin Pop authored 14 years ago


Currently, if a job execution raises a Ganeti-specific error (i.e.
subclass of GenericError), then we encode it as (error class, [error
args]). This matches the RAPI documentation.

However, if we get a non-Ganeti error, then we encode it as simply
str(err), a single string. This means that the opresult field is not
according to the RAPI docs, and thus it's hard to reliably parse the
job results.

This patch changes the encoding of a failed job (via failure) to always
be an OpExecError, so that we always encode it properly. For the command
line interface, the behaviour is the same, as any non-Ganeti errors get
re-encoded as OpExecError anyway. For the RAPI clients, it only means
that we always present the same type for results. The actual error value
is the same, since the err.args is either way str(original_error);
compare the original (doesn't contain the ValueError):

  "opresult": [
    "invalid literal for int(): aa"
  ],

with:

  "opresult": [
    [
      "OpExecError",
      [
        "invalid literal for int(): aa"
      ]
    ]
  ],

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

599ee321

workerpool: Change signature of AddTask function to not use *args · b2e8a4d9

Michael Hanselmann authored 14 years ago


By changing it to a normal parameter, which must be a sequence, we can
start using keyword parameters.

Before this patch all arguments to “AddTask(self, *args)” were passed as
arguments to the worker's “RunTask” method. Priorities, which should be
optional and will be implemented in a future patch, must be passed as a keyword
parameter. This means “*args” can no longer be used as one can't combine *args
and keyword parameters in a clean way:

>>> def f(name=None, *args):
...   print "%r, %r" % (args, name)
...
>>> f("p1", "p2", "p3", name="thename")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 TypeError: f() got multiple values for keyword argument 'name'

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b2e8a4d9

Jul 16, 2010

Implement lock names for debugging purposes · 7f93570a

Iustin Pop authored 16 years ago


This patch adds lock names to SharedLocks and LockSets, that can be used
later for displaying the actual locks being held/used in places where we
only have the lock, and not the entire context of the locking operation.

Since I realized that the production code doesn't call LockSet with the
proper members= syntax, but directly as positional parameters, I've
converted this (and the arguments to GlobalLockManager) into positional
arguments.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7f93570a

Jul 15, 2010

jqueue: Factorize code waiting for job changes · 989a8bee

Michael Hanselmann authored 14 years ago


By splitting the _WaitForJobChangesHelper class into multiple smaller
classes, we gain in several places:

- Simpler code, less interaction between functions and variables
- Easy to unittest (close to 100% coverage)
- Waiting for job changes has no direct knowledge of queue anymore (it
  doesn't references queue functions anymore, especially not private ones)
- Activate inotify only if there was no change at the beginning (and
  checking again right away to avoid race conditions)

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

989a8bee

Jul 12, 2010

jqueue: Setup inotify before checking for any job changes · 2034c70d

Michael Hanselmann authored 14 years ago


Since the code waiting for job changes was modified to use inotify,
a race condition between checking for changes the first time and
setting up inotify occurs. If the job is modified after the check
but before inotify is active, changes would only be noticed after
the timeout (29 seconds in most cases) expired.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2034c70d

Jul 09, 2010

Introduce lib/netutils.py · a744b676

Manuel Franceschini authored 14 years ago


This patch moves network utility functions to a dedicated module.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a744b676

Jul 06, 2010

Fix opcode transition from WAITLOCK to RUNNING · 271daef8

Iustin Pop authored 14 years ago


With the recent changes in the job queue, an old bug surfaced: we never
serialized the status change when in NotifyStart, thus a crash of the
master would have left the job queue oblivious to the fact that the job
was actually running.

In the previous implementation, queries against the job status were
using the in-memory object, so they 'saw' and reported correctly the
running status. But the new implementation just looks at the on-disk
version, and thus didn't see this transition.

The patch also moves NotifyStart to a decorator-based version (like the
other functions), which generates a lot of churn in the diff, sorry.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

271daef8