design-chained-jobs.rst 8.19 KB
Newer Older
Michael Hanselmann's avatar
Michael Hanselmann committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
============
Chained jobs
============

.. contents:: :depth: 4

This is a design document about the innards of Ganeti's job processing.
Readers are advised to study previous design documents on the topic:

- :ref:`Original job queue <jqueue-original-design>`
- :ref:`Job priorities <jqueue-job-priority-design>`
- :doc:`LU-generated jobs <design-lu-generated-jobs>`


Current state and shortcomings
==============================

Ever since the introduction of the job queue with Ganeti 2.0 there have
been situations where we wanted to run several jobs in a specific order.
Due to the job queue's current design, such a guarantee can not be
given. Jobs are run according to their priority, their ability to
acquire all necessary locks and other factors.

One way to work around this limitation is to do some kind of job
grouping in the client code. Once all jobs of a group have finished, the
next group is submitted and waited for. There are different kinds of
clients for Ganeti, some of which don't share code (e.g. Python clients
vs. htools). This design proposes a solution which would be implemented
as part of the job queue in the master daemon.


Proposed changes
================

With the implementation of :ref:`job priorities
<jqueue-job-priority-design>` the processing code was re-architectured
and became a lot more versatile. It now returns jobs to the queue in
case the locks for an opcode can't be acquired, allowing other
jobs/opcodes to be run in the meantime.

The proposal is to add a new, optional property to opcodes to define
dependencies on other jobs. Job X could define opcodes with a dependency
on the success of job Y and would only be run once job Y is finished. If
there's a dependency on success and job Y failed, job X would fail as
well. Since such dependencies would use job IDs, the jobs still need to
be submitted in the right order.

.. pyassert::

   # Update description below if finalized job status change
   constants.JOBS_FINALIZED == frozenset([
     constants.JOB_STATUS_CANCELED,
     constants.JOB_STATUS_SUCCESS,
     constants.JOB_STATUS_ERROR,
     ])

The new attribute's value would be a list of two-valued tuples. Each
tuple contains a job ID and a list of requested status for the job
depended upon. Only final status are accepted
(:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is
equivalent to specifying all final status (except
:pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially).
An opcode runs only once all its dependency requirements have been
fulfilled.

Any job referring to a cancelled job is also cancelled unless it
Iustin Pop's avatar
Iustin Pop committed
67
explicitly lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested
Michael Hanselmann's avatar
Michael Hanselmann committed
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
status.

In case a referenced job can not be found in the normal queue or the
archive, referring jobs fail as the status of the referenced job can't
be determined.

With this change, clients can submit all wanted jobs in the right order
and proceed to wait for changes on all these jobs (see
``cli.JobExecutor``). The master daemon will take care of executing them
in the right order, while still presenting the client with a simple
interface.

Clients using the ``SubmitManyJobs`` interface can use relative job IDs
(negative integers) to refer to jobs in the same submission.

.. highlight:: javascript

Example data structures::

  # First job
  {
    "job_id": "6151",
    "ops": [
      { "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., },
      { "OP_ID": "OP_INSTANCE_FAILOVER", ..., },
      ],
  }

  # Second job, runs in parallel with first job
  {
    "job_id": "7687",
    "ops": [
      { "OP_ID": "OP_INSTANCE_MIGRATE", ..., },
      ],
  }

  # Third job, depending on success of previous jobs
  {
    "job_id": "9218",
    "ops": [
      { "OP_ID": "OP_NODE_SET_PARAMS",
        "depend": [
          [6151, ["success"]],
          [7687, ["success"]],
          ],
        "offline": True, },
      ],
  }


118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
Implementation details
----------------------

Status while waiting for dependencies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Jobs waiting for dependencies are certainly not in the queue anymore and
therefore need to change their status from "queued". While waiting for
opcode locks the job is in the "waiting" status (the constant is named
``JOB_STATUS_WAITLOCK``, but the actual value is ``waiting``). There the
following possibilities:

#. Introduce a new status, e.g. "waitdeps".

   Pro:

   - Clients know for sure a job is waiting for dependencies, not locks

   Con:

   - Code and tests would have to be updated/extended for the new status
   - List of possible state transitions certainly wouldn't get simpler
   - Breaks backwards compatibility, older clients might get confused

#. Use existing "waiting" status.

   Pro:

   - No client changes necessary, less code churn (note that there are
     clients which don't live in Ganeti core)
   - Clients don't need to know the difference between waiting for a job
     and waiting for a lock; it doesn't make a difference
   - Fewer state transitions (see commit ``5fd6b69479c0``, which removed
     many state transitions and disk writes)

   Con:

   - Not immediately visible what a job is waiting for, but it's the
     same issue with locks; this is the reason why the lock monitor
     (``gnt-debug locks``) was introduced; job dependencies can be shown
     as "locks" in the monitor

Based on these arguments, the proposal is to do the following:

- Rename ``JOB_STATUS_WAITLOCK`` constant to ``JOB_STATUS_WAITING`` to
  reflect its actual meanting: the job is waiting for something
- While waiting for dependencies and locks, jobs are in the "waiting"
  status
- Export dependency information in lock monitor; example output::

    Name      Mode Owner Pending
    job/27491 -    -     success:job/34709,job/21459
    job/21459 -    -     success,error:job/14513


Cost of deserialization
~~~~~~~~~~~~~~~~~~~~~~~

To determine the status of a dependency job the job queue must have
access to its data structure. Other queue operations already do this,
e.g. archiving, watching a job's progress and querying jobs.

Initially (Ganeti 2.0/2.1) the job queue shared the job objects
in memory and protected them using locks. Ganeti 2.2 (see :doc:`design
document <design-2.2>`) changed the queue to read and deserialize jobs
from disk. This significantly reduced locking and code complexity.
Nowadays inotify is used to wait for changes on job files when watching
a job's progress.

Reading from disk and deserializing certainly has some cost associated
with it, but it's a significantly simpler architecture than
synchronizing in memory with locks. At the stage where dependencies are
evaluated the queue lock is held in shared mode, so different workers
can read at the same time (deliberately ignoring CPython's interpreter
lock).

It is expected that the majority of executed jobs won't use
dependencies and therefore won't be affected.


Michael Hanselmann's avatar
Michael Hanselmann committed
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
Other discussed solutions
=========================

Job-level attribute
-------------------

At a first look it might seem to be better to put dependencies on
previous jobs at a job level. However, it turns out that having the
option of defining only a single opcode in a job as having such a
dependency can be useful as well. The code complexity in the job queue
is equivalent if not simpler.

Since opcodes are guaranteed to run in order, clients can just define
the dependency on the first opcode.

Another reason for the choice of an opcode-level attribute is that the
current LUXI interface for submitting jobs is a bit restricted and would
need to be changed to allow the addition of job-level attributes,
potentially requiring changes in all LUXI clients and/or breaking
backwards compatibility.


Client-side logic
-----------------

There's at least one implementation of a batched job executor twisted
into the ``burnin`` tool's code. While certainly possible, a client-side
solution should be avoided due to the different clients already in use.
For one, the :doc:`remote API <rapi>` client shouldn't import
non-standard modules. htools are written in Haskell and can't use Python
modules. A batched job executor contains quite some logic. Even if
cleanly abstracted in a (Python) library, sharing code between different
clients is difficult if not impossible.


.. vim: set textwidth=72 :
.. Local Variables:
.. mode: rst
.. fill-column: 72
.. End: