Commits · e50d88078e1dbfe3d78aa174b760aa6142f54b6c · itminedu / snf-ganeti

Sep 13, 2010

Remove mcpu's ReportLocks callback · acf931b7

Michael Hanselmann authored 14 years ago


This is no longer needed with the new lock monitor. One callback is kept to
check for cancelled jobs.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

acf931b7

Revert "jqueue: Resume jobs from “waitlock” status" · 5ef699a0

Michael Hanselmann authored 14 years ago


This reverts commit 4008c8ed.

While it worked in my initial tests, I've now found cases where this doesn't
work properly as it is. More work is needed and will be done as part of the
Ganeti 2.3 job queue changes.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5ef699a0

Sep 10, 2010

jqueue: Resume jobs from “waitlock” status · 4008c8ed

Michael Hanselmann authored 14 years ago


After an unclean restart of ganeti-masterd, jobs in the “waitlock” status can
be safely restarted. They hadn't modified the cluster yet.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

4008c8ed

jqueue: Move queue inspection into separate function · de9d02c7

Michael Hanselmann authored 14 years ago


This makes the __init__ function a lot smaller while not changing
functionality.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

de9d02c7

jqueue: Don't update file in MarkUnfinishedOps · 747f6113

Michael Hanselmann authored 14 years ago


This reduced the number of updates to the job files. It's used in two places
while processing a job and the file is updated just afterwards.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

747f6113

Sep 07, 2010

jqueue: Use separate function for encoding errors · 6760e4ed

Michael Hanselmann authored 14 years ago


Comes with unittest.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6760e4ed

Aug 24, 2010

workerpool: Allow setting task name · daba67c7

Michael Hanselmann authored 14 years ago

With this patch, the task name is added to the thread name and will show up in
logs. Log messages from jobs will look like “pid=578/JobQueue14/Job13 mcpu:289
DEBUG LU locks acquired/cluster/BGL/shared”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

daba67c7

Aug 19, 2010

jqueue: Remove lock status field · 9bdab621

Michael Hanselmann authored 14 years ago

With the job queue changes for Ganeti 2.2, watched and queried jobs are
loaded directly from disk, rendering the in-memory “lock_status” field
useless. Writing it to disk would be possible, but has a huge cost at
runtime (when tested, processing 1'000 opcodes involved 4'000 additional
writes to job files, even with replication turned off).

Using an additional in-memory dictionary to just manage this field turned
out to be a complicated task due to the necessary locking.

The plan is to introduce a more generic lock debugging mechanism in the
near future. Hence the decision is to remove this field now instead of
spending a lot of time to make it working again.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9bdab621

Aug 18, 2010

jqueue: Mark opcodes following failed ones as failed, too · 963a068b

Michael Hanselmann authored 14 years ago


When an opcode fails, the job queue would leave following opcodes as “queued”,
which can be quite confusing. With this patch, they're all marked as failed and
assertions are added to check this.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

963a068b

jqueue: Work around race condition between job processing and archival · 6ea72e43

Michael Hanselmann authored 14 years ago


This is a simplified version of a patch I sent earlier to make sure the job
file is only written once with a finalized status.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6ea72e43

Aug 17, 2010

jqueue: More checks for cancelling queued job · dc1e2262

Michael Hanselmann authored 14 years ago


We can also check when the lock status is updated. This will
improve job cancelling.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

dc1e2262

jqueue: Add more debug output · e35344b4

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e35344b4

Jul 30, 2010

Fix a few job archival issues · aa9f8167

Iustin Pop authored 14 years ago


This patch fixes two issues with job archival. First, the
LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add
None to the job list; we can't anyway, as this raises an exception:

  node1# gnt-job archive foo
  Unhandled protocol error while talking to the master daemon:
  Caught exception: cannot create weak reference to 'NoneType' object

After fixing this, job archival of missing jobs will just continue
silently, so we modify gnt-job archive to log jobs which were not
archived and to return exit code 1 for any missing jobs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

aa9f8167

Jul 29, 2010

Change handling of non-Ganeti errors in jqueue · 599ee321

Iustin Pop authored 14 years ago


Currently, if a job execution raises a Ganeti-specific error (i.e.
subclass of GenericError), then we encode it as (error class, [error
args]). This matches the RAPI documentation.

However, if we get a non-Ganeti error, then we encode it as simply
str(err), a single string. This means that the opresult field is not
according to the RAPI docs, and thus it's hard to reliably parse the
job results.

This patch changes the encoding of a failed job (via failure) to always
be an OpExecError, so that we always encode it properly. For the command
line interface, the behaviour is the same, as any non-Ganeti errors get
re-encoded as OpExecError anyway. For the RAPI clients, it only means
that we always present the same type for results. The actual error value
is the same, since the err.args is either way str(original_error);
compare the original (doesn't contain the ValueError):

  "opresult": [
    "invalid literal for int(): aa"
  ],

with:

  "opresult": [
    [
      "OpExecError",
      [
        "invalid literal for int(): aa"
      ]
    ]
  ],

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

599ee321

workerpool: Change signature of AddTask function to not use *args · b2e8a4d9

Michael Hanselmann authored 14 years ago


By changing it to a normal parameter, which must be a sequence, we can
start using keyword parameters.

Before this patch all arguments to “AddTask(self, *args)” were passed as
arguments to the worker's “RunTask” method. Priorities, which should be
optional and will be implemented in a future patch, must be passed as a keyword
parameter. This means “*args” can no longer be used as one can't combine *args
and keyword parameters in a clean way:

>>> def f(name=None, *args):
...   print "%r, %r" % (args, name)
...
>>> f("p1", "p2", "p3", name="thename")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 TypeError: f() got multiple values for keyword argument 'name'

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b2e8a4d9

Jul 16, 2010

Implement lock names for debugging purposes · 7f93570a

Iustin Pop authored 16 years ago


This patch adds lock names to SharedLocks and LockSets, that can be used
later for displaying the actual locks being held/used in places where we
only have the lock, and not the entire context of the locking operation.

Since I realized that the production code doesn't call LockSet with the
proper members= syntax, but directly as positional parameters, I've
converted this (and the arguments to GlobalLockManager) into positional
arguments.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7f93570a

Jul 15, 2010

jqueue: Factorize code waiting for job changes · 989a8bee

Michael Hanselmann authored 14 years ago


By splitting the _WaitForJobChangesHelper class into multiple smaller
classes, we gain in several places:

- Simpler code, less interaction between functions and variables
- Easy to unittest (close to 100% coverage)
- Waiting for job changes has no direct knowledge of queue anymore (it
  doesn't references queue functions anymore, especially not private ones)
- Activate inotify only if there was no change at the beginning (and
  checking again right away to avoid race conditions)

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

989a8bee

Jul 12, 2010

jqueue: Setup inotify before checking for any job changes · 2034c70d

Michael Hanselmann authored 14 years ago


Since the code waiting for job changes was modified to use inotify,
a race condition between checking for changes the first time and
setting up inotify occurs. If the job is modified after the check
but before inotify is active, changes would only be noticed after
the timeout (29 seconds in most cases) expired.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2034c70d

Jul 09, 2010

Introduce lib/netutils.py · a744b676

Manuel Franceschini authored 14 years ago


This patch moves network utility functions to a dedicated module.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a744b676

Jul 06, 2010

Fix opcode transition from WAITLOCK to RUNNING · 271daef8

Iustin Pop authored 14 years ago


With the recent changes in the job queue, an old bug surfaced: we never
serialized the status change when in NotifyStart, thus a crash of the
master would have left the job queue oblivious to the fact that the job
was actually running.

In the previous implementation, queries against the job status were
using the in-memory object, so they 'saw' and reported correctly the
running status. But the new implementation just looks at the on-disk
version, and thus didn't see this transition.

The patch also moves NotifyStart to a decorator-based version (like the
other functions), which generates a lot of churn in the diff, sorry.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

271daef8

Jun 28, 2010

jqueue: remove the _big_jqueue_lock module global · ebb80afa

Guido Trotter authored 14 years ago


By using ssynchronized in the new way, we can remove the module-global
_big_jqueue_lock and revert back to an internal _lock inside the jqueue.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ebb80afa

Share the jqueue lock on job-local changes · 3c0d60d0

Guido Trotter authored 15 years ago


We can share the jqueue lock when we do per-job updates. These only
conflict with updates/checks on the same job from another thread (eg.
CancelJob, ArchiveJob, which keep the lock unshared, since they are less
frequent).

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3c0d60d0

_OpExecCallbacks abstract _AppendFeedback · 9bf5e01f

Guido Trotter authored 15 years ago


Move some code to a decorated function rather than explicitely
acquiring/releasing the lock in AppendFeedback.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9bf5e01f

jqueue: convert to a SharedLock() · 99bd4f0a

Guido Trotter authored 15 years ago


Remove the jqueue _lock member and convert to a _big_jqueue_lock
sharedlock. This allows smooth transition from the old single lock to a
more granular approach.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

99bd4f0a

MarkUnfinishedOps: update job file on disk · 39ed3a98

Guido Trotter authored 14 years ago


Every time we call MarkUnfinishedOps we do it in a try/finally block
that updates the job file. With this patch we move the try/finally
inside. CancelJobUnlocked is removed, because it just becomes a wrapper
over MarkUnfinishedOps with two constant values.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

39ed3a98

Remove spurious empty line · a1bfdeb1

Guido Trotter authored 14 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a1bfdeb1

Jun 23, 2010

Remove job object condition · 41593f6b

Guido Trotter authored 14 years ago


We don't need it anymore, since nobody waits on it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

41593f6b

Parallelize WaitForJobChanges · 6c2549d6

Guido Trotter authored 15 years ago


As for QueryJobs we rely on file updates rather than condition
notification to acquire job changes. In order to do that we use the
pyinotify module to watch files. This might make the client a bit slower
(pending planned improvements, such as subscription-based
WaitForJobChanges) but detaches it from the job execution.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6c2549d6

Update the job file on feedback · b3855790

Guido Trotter authored 14 years ago


This is needed to convert waitforjobchanges to use inotify and the
on-disk version and decouple it from the job queue lock. No replication
to remote nodes is done, to keep the operation fast.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b3855790

Don't lock on QueryJobs, by using the disk version · 9f7b4967

Guido Trotter authored 15 years ago


We move from querying the in-memory version to loading all jobs from the
disk. Since the jobs are written/deleted on disk in an atomic manner, we
don't need to lock at all. Also, since we're just looking at the
contents of a directory, we don't need to check that the job queue is
"open".

If some jobs are removed between when we listed them and us loading
them, we need to be able to cope: if we were asked to load those jobs
specifically, we must report the failure, but if we were just asked to
"load all" we shall just not consider them as part of the "all" set,
since they were deleted.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

9f7b4967

Add JobQueue.SafeLoadJobFromDisk · 0f9c08dc

Guido Trotter authored 14 years ago


This will be used to read a job file without having to deal with
exceptions from _LoadJobFromDisk.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

0f9c08dc

jqueue._LoadJobFromDisk: remove safety archival · 3d6c5566

Guido Trotter authored 14 years ago


Currently _LoadJobFromDisk archives job files it finds corrupted. Since
we want to use it to load files without holding locks, this could cause
a conflict: we just move the feature to _LoadJobUnlocked which is always
called with the lock held.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3d6c5566

Jun 17, 2010

jqueue.AddManyJobs: use AddManyTasks · 7beb1e53

Guido Trotter authored 15 years ago


Rather than adding the jobs to the worker pool one at a time, we add
them all together, which is slightly faster, and ensures they don't get
started while we loop.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

7beb1e53

jqueue: make replication on job update optional · 4c36bdf5

Guido Trotter authored 15 years ago


Sometimes it's useful to write to the local filesystem, but immediate
replication to all master candidates is not needed.

The _WriteAndReplicateFileUnlocked function gets renamed to
_UpdateJobQueueFile, as calling "write and replicate, but don't
replicate" seemed a bit strange.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4c36bdf5

s/queue._GetJobInfoUnlocked/job.GetInfo/ · 6a290889

Guido Trotter authored 15 years ago


The job queue currently has a static _GetJobInfoUnlocked method.
Changing it to be a normal method of _QueuedJob, which makes more sense.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6a290889

Abstract loading job file from disk · 162c8636

Guido Trotter authored 15 years ago

Move the work from _LoadJobUnlocked to _LoadJobFileFromDisk, which can
then be used in other contexts as well. Also, if we fail to deserialize
the job, archive it as well (before we archived it only if we failed to
create the related object, but kept it there if deserialization failed.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

162c8636

Jun 15, 2010

ListVisibleFiles: do not sort output · b5b8309d

Guido Trotter authored 15 years ago


Among all users, turns out just one *may* need the output to be sorted.
All the others can cope without.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b5b8309d

jqueue: simplify removal from _nodes · d8e0dc17

Guido Trotter authored 15 years ago


Somewhere we do try/del/except and somewhere just pop. Using pop
everywhere saves lines of code.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

d8e0dc17

Jun 11, 2010

Cache a few bits of status in jqueue · 20571a26

Guido Trotter authored 15 years ago


Currently each time we submit a job we check the job queue size, and the
drained file. With this change we keep these pieces of information in
memory and don't read them from the filesystem each time.

Significant changes include:
  - The drained value can only be properly set by calling the
    appropriate cluster command "gnt-cluster queue drain/undrain" and
    not by removing/creating the file in the job queue directory. Not
    that anybody would have done it in this undocumented way before.
  - We get rid of the soft limit for the job queue, which we haven't
    ever used anyway.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

20571a26

jqueue: Rename _queue_lock to _queue_filelock · a71f9c7d

Guido Trotter authored 15 years ago


The name clarifies the difference between this and the internal lock.
Also explain a bit better what it is.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

a71f9c7d