design-daemons.rst 24.6 KB
Newer Older
Michele Tartara's avatar
Michele Tartara committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
==========================
Ganeti daemons refactoring
==========================

.. contents:: :depth: 2

This is a design document detailing the plan for refactoring the internal
structure of Ganeti, and particularly the set of daemons it is divided into.


Current state and shortcomings
==============================

Ganeti is comprised of a growing number of daemons, each dealing with part of
the tasks the cluster has to face, and communicating with the other daemons
using a variety of protocols.

Specifically, as of Ganeti 2.8, the situation is as follows:

``Master daemon (MasterD)``
  It is responsible for managing the entire cluster, and it's written in Python.
  It is executed on a single node (the master node). It receives the commands
  given by the cluster administrator (through the remote API daemon or the
  command line tools) over the LUXI protocol.  The master daemon is responsible
  for creating and managing the jobs that will execute such commands, and for
  managing the locks that ensure the cluster will not incur in race conditions.

  Each job is managed by a separate Python thread, that interacts with the node
  daemons via RPC calls.

  The master daemon is also responsible for managing the configuration of the
  cluster, changing it when required by some job. It is also responsible for
  copying the configuration to the other master candidates after updating it.

``RAPI daemon (RapiD)``
  It is written in Python and runs on the master node only. It waits for
  requests issued remotely through the remote API protocol. Then, it forwards
  them, using the LUXI protocol, to the master daemon (if they are commands) or
  to the query daemon if they are queries about the configuration (including
  live status) of the cluster.

``Node daemon (NodeD)``
  It is written in Python. It runs on all the nodes. It is responsible for
  receiving the master requests over RPC and execute them, using the appropriate
  backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
  the execution of queries gathering live data on behalf of the query daemon.

``Configuration daemon (ConfD)``
  It is written in Haskell. It runs on all the master candidates. Since the
  configuration is replicated only on the master node, this daemon exists in
  order to provide information about the configuration to nodes needing them.
  The requests are done through ConfD's own protocol, HMAC signed,
  implemented over UDP, and meant to be used by parallely querying all the
  master candidates (or a subset thereof) and getting the most up to date
  answer. This is meant as a way to provide a robust service even in case master
  is temporarily unavailable.

``Query daemon (QueryD)``
  It is written in Haskell. It runs on all the master candidates. It replies
  to Luxi queries about the current status of the system, including live data it
  obtains by querying the node daemons through RPCs.

``Monitoring daemon (MonD)``
  It is written in Haskell. It runs on all nodes, including the ones that are
  not vm-capable. It is meant to provide information on the status of the
  system. Such information is related only to the specific node the daemon is
  running on, and it is provided as JSON encoded data over HTTP, to be easily
  readable by external tools.
  The monitoring daemon communicates with ConfD to get information about the
  configuration of the cluster. The choice of communicating with ConfD instead
  of MasterD allows it to obtain configuration information even when the cluster
  is heavily degraded (e.g.: when master and some, but not all, of the master
  candidates are unreachable).

The current structure of the Ganeti daemons is inefficient because there are
many different protocols involved, and each daemon needs to be able to use
multiple ones, and has to deal with doing different things, thus making
sometimes unclear which daemon is responsible for performing a specific task.

Also, with the current configuration, jobs are managed by the master daemon
using python threads. This makes terminating a job after it has started a
difficult operation, and it is the main reason why this is not possible yet.

The master daemon currently has too many different tasks, that could be handled
better if split among different daemons.


Proposed changes
================

In order to improve on the current situation, a new daemon subdivision is
proposed, and presented hereafter.

.. digraph:: "new-daemons-structure"

  {rank=same; RConfD LuxiD;}
  {rank=same; Jobs rconfigdata;}
  node [shape=box]
  RapiD [label="RapiD [M]"]
  LuxiD [label="LuxiD [M]"]
  WConfD [label="WConfD [M]"]
  Jobs [label="Jobs [M]"]
  RConfD [label="RConfD [MC]"]
  MonD [label="MonD [All]"]
  NodeD [label="NodeD [All]"]
  Clients [label="gnt-*\nclients [M]"]
  p1 [shape=none, label=""]
  p2 [shape=none, label=""]
  p3 [shape=none, label=""]
  p4 [shape=none, label=""]
  configdata [shape=none, label="config.data"]
  rconfigdata [shape=none, label="config.data\n[MC copy]"]
  locksdata [shape=none, label="locks.data"]

  RapiD -> LuxiD [label="LUXI"]
  LuxiD -> WConfD [label="WConfD\nproto"]
  LuxiD -> Jobs [label="fork/exec"]
  Jobs -> WConfD [label="WConfD\nproto"]
  Jobs -> NodeD [label="RPC"]
  LuxiD -> NodeD [label="RPC"]
  rconfigdata -> RConfD
  configdata -> rconfigdata [label="sync via\nNodeD RPC"]
  WConfD -> NodeD [label="RPC"]
  WConfD -> configdata
  WConfD -> locksdata
  MonD -> RConfD [label="RConfD\nproto"]
  Clients -> LuxiD [label="LUXI"]
  p1 -> MonD [label="MonD proto"]
  p2 -> RapiD [label="RAPI"]
  p3 -> RConfD [label="RConfD\nproto"]
  p4 -> Clients [label="CLI"]

``LUXI daemon (LuxiD)``
  It will be written in Haskell. It will run on the master node and it will be
  the only LUXI server, replying to all the LUXI queries. These includes both
  the queries about the live configuration of the cluster, previously served by
  QueryD, and the commands actually changing the status of the cluster by
  submitting jobs. Therefore, this daemon will also be the one responsible with
  managing the job queue. When a job needs to be executed, the LuxiD will spawn
  a separate process tasked with the execution of that specific job, thus making
  it easier to terminate the job itself, if needeed.  When a job requires locks,
  LuxiD will request them from WConfD.
  In order to keep availability of the cluster in case of failure of the master
  node, LuxiD will replicate the job queue to the other master candidates, by
  RPCs to the NodeD running there (the choice of RPCs for this task might be
  reviewed at a second time, after implementing this design).

``Configuration management daemon (WConfD)``
  It will run on the master node and it will be responsible for the management
  of the authoritative copy of the cluster configuration (that is, it will be
  the daemon actually modifying the ``config.data`` file). All the requests of
  configuration changes will have to pass through this daemon, and will be
  performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
  protocol will be defined in the separate design document that will detail the
  WConfD separation).  Having a single point of configuration management will
  also allow Ganeti to get rid of possible race conditions due to concurrent
  modifications of the configuration.  When the configuration is updated, it
  will have to push the received changes to the other master candidates, via
  RPCs, so that RConfD daemons and (in case of a failure on the master node)
  the WConfD daemon on the new master can access an up-to-date version of it
  (the choice of RPCs for this task might be reviewed at a second time). This
  daemon will also be the one responsible for managing the locks, granting them
  to the jobs requesting them, and taking care of freeing them up if the jobs
  holding them crash or are terminated before releasing them.  In order to do
  this, each job, after being spawned by LuxiD, will open a local unix socket
  that will be used to communicate with it, and will be destroyed when the job
  terminates.  LuxiD will be able to check, after a timeout, whether the job is
  still running by connecting here, and to ask WConfD to forcefully remove the
  locks if the socket is closed.
  Also, WConfD should hold a serialized list of the locks and their owners in a
  file (``locks.data``), so that it can keep track of their status in case it
  crashes and needs to be restarted (by asking LuxiD which of them are still
  running).
  Interaction with this daemon will be performed using Unix sockets.

``Configuration query daemon (RConfD)``
  It is written in Haskell, and it corresponds to the old ConfD. It will run on
  all the master candidates and it will serve information about the the static
  configuration of the cluster (the one contained in ``config.data``). The
  provided information will be highly available (as in: a response will be
  available as long as a stable-enough connection between the client and at
  least one working master candidate is available) and its freshness will be
  best effort (the most recent reply from any of the master candidates will be
  returned, but it might still be older than the one available through WConfD).
  The information will be served through the ConfD protocol.

``Rapi daemon (RapiD)``
  It remains basically unchanged, with the only difference that all of its LUXI
  query are directed towards LuxiD instead of being split between MasterD and
  QueryD.

``Monitoring daemon (MonD)``
  It remains unaffected by the changes in this design document. It will just get
  some of the data it needs from RConfD instead of the old ConfD, but the
  interfaces of the two are identical.

``Node daemon (NodeD)``
  It remains unaffected by the changes proposed in the design document. The only
  difference being that it will receive its RPCs from LuxiD (for job queue
  replication), from WConfD (for configuration replication) and for the
  processes executing single jobs (for all the operations to be performed by
  nodes) instead of receiving them just from MasterD.

This restructuring will allow us to reorganize and improve the codebase,
introducing cleaner interfaces and giving well defined and more restricted tasks
to each daemon.

Furthermore, having more well-defined interfaces will allow us to have easier
upgrade procedures, and to work towards the possibility of upgrading single
components of a cluster one at a time, without the need for immediately
upgrading the entire cluster in a single step.


Implementation
==============

While performing this refactoring, we aim to increase the amount of
Haskell code, thus benefiting from the additional type safety provided by its
wide compile-time checks. In particular, all the job queue management and the
configuration management daemon will be written in Haskell, taking over the role
currently fulfilled by Python code executed as part of MasterD.

The changes describe by this design document are quite extensive, therefore they
will not be implemented all at the same time, but through a sequence of steps,
leaving the codebase in a consistent and usable state.

#. Rename QueryD to LuxiD.
   A part of LuxiD, the one replying to configuration
   queries including live information about the system, already exists in the
   form of QueryD. This is being renamed to LuxiD, and will form the first part
   of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
   beginning, only the already existing queries will be replied to by LuxiD.
   More queries will be implemented in the next versions.

#. Let LuxiD be the interface for the queries and MasterD be their executor.
   Currently, MasterD is the only responsible for receiving and executing LUXI
   queries, and for managing the jobs they create.
   Receiving the queries and managing the job queue will be extracted from
   MasterD into LuxiD.
   Actually executing jobs will still be done by MasterD, that contains all the
   logic for doing that and for properly managing locks and the configuration.
242
243
   At this stage, scheduling will simply consist in starting jobs until a fixed
   maximum number of simultaneously running jobs is reached.
Michele Tartara's avatar
Michele Tartara committed
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263

#. Extract WConfD from MasterD.
   The logic for managing the configuration file is factored out to the
   dedicated WConfD daemon. All configuration changes, currently executed
   directly by MasterD, will be changed to be IPC requests sent to the new
   daemon.

#. Extract locking management from MasterD.
   The logic for managing and granting locks is extracted to WConfD as well.
   Locks will not be taken directly anymore, but asked via IPC to WConfD.
   This step can be executed on its own or at the same time as the previous one.

#. Jobs are executed as processes.
   The logic for running jobs is rewritten so that each job can be managed by an
   independent process. LuxiD will spawn a new (Python) process for every single
   job. The RPCs will remain unchanged, and the LU code will stay as is as much
   as possible.
   MasterD will cease to exist as a deamon on its own at this point, but not
   before.

264
265
266
267
268
#. Improve job scheduling algorithm.
   The simple algorithm for scheduling jobs will be replaced by a more
   intelligent one. Also, the implementation of :doc:`design-optables` can be
   started.

269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
Job death detection
-------------------

**Requirements:**

- It must be possible to reliably detect a death of a process even under
  uncommon conditions such as very heavy system load.
- A daemon must be able to detect a death of a process even if the
  daemon is restarted while the process is running.
- The solution must not rely on being able to communicate with
  a process.
- The solution must work for the current situation where multiple jobs
  run in a single process.
- It must be POSIX compliant.

These conditions rule out simple solutions like checking a process ID
(because the process might be eventually replaced by another process
with the same ID) or keeping an open connection to a process.

**Solution:** As a job process is spawned, before attempting to
communicate with any other process, it will create a designated empty
lock file, open it, acquire an *exclusive* lock on it, and keep it open.
When connecting to a daemon, the job process will provide it with the
path of the file. If the process dies unexpectedly, the operating system
kernel automatically cleans up the lock.

Therefore, daemons can check if a process is dead by trying to acquire
a *shared* lock on the lock file in a non-blocking mode:

- If the locking operation succeeds, it means that the exclusive lock is
  missing, therefore the process has died, but the lock
  file hasn't been cleaned up yet. The daemon should release the lock
  immediately. Optionally, the daemon may delete the lock file.
- If the file is missing, the process has died and the lock file has
  been cleaned up.
- If the locking operation fails due to a lock conflict, it means
  the process is alive.

Using shared locks for querying lock files ensures that the detection
works correctly even if multiple daemons query a file at the same time.

A job should close and remove its lock file when completely finishes.
The WConfD daemon will be responsible for removing stale lock files of
jobs that didn't remove its lock files themselves.

314
315
316
317
318
319
320
321
322
**Statelessness of the protocol:** To keep our protocols stateless,
the job id and the path the to lock file are sent as part of every
request that deals with resources, in particular the Ganeti Locks.
All resources are owned by the pair (job id, lock file). In this way,
several jobs can live in the same process (as it will be in the
transition period), but owner death detection still only depends on the
owner of the resource. In particular, no additional lookup table is
needed to obtain the lock file for a given owner.

323
324
325
326
327
328
329
**Considered alternatives:** An alternative to creating a separate lock
file would be to lock the job status file. However, file locks are kept
only as long as the file is open. Therefore any operation followed by
closing the file would cause the process to release the lock. In
particular, with jobs as threads, the master daemon wouldn't be able to
keep locks and operate on job files at the same time.

330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
WConfD details
--------------

WConfD will communicate with its clients through a Unix domain socket for both
configuration management and locking. Clients can issue multiple RPC calls
through one socket. For each such a call the client sends a JSON request
document with a remote function name and data for its arguments. The server
replies with a JSON response document containing either the result of
signalling a failure.

Any state associated with client processes will be mirrored on persistent
storage and linked to the identity of processes so that the WConfD daemon will
be able to resume its operation at any point after a restart or a crash. WConfD
will track each client's process start time along with its process ID to be
able detect if a process dies and it's process ID is reused.  WConfD will clear
all locks and other state associated with a client if it detects it's process
no longer exists.

Configuration management
++++++++++++++++++++++++

The new configuration management protocol will be implemented in the following
steps:

354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
Step 1:
  #. Implement the following functions in WConfD and export them through
     RPC:

     - Obtain a single internal lock, either in shared or
       exclusive mode. This lock will substitute the current lock
       ``_config_lock`` in config.py.
     - Release the lock.
     - Return the whole configuration data to a client.
     - Receive the whole configuration data from a client and replace the
       current configuration with it. Distribute it to master candidates
       and distribute the corresponding *ssconf*.

     WConfD must detect deaths of its clients (see `Job death
     detection`_) and release locks automatically.

  #. In config.py modify public methods that access configuration:

     - Instead of acquiring a local lock, obtain a lock from WConfD
       using the above functions
     - Fetch the current configuration from WConfD.
     - Use it to perform the method's task.
     - If the configuration was modified, send it to WConfD at the end.
     - Release the lock to WConfD.

  This will decouple the configuration management from the master daemon,
  even though the specific configuration tasks will still performed by
  individual jobs.

  After this step it'll be possible access the configuration from separate
  processes.

Step 2:
  #. Reimplement all current methods of ``ConfigWriter`` for reading and
     writing the configuration of a cluster in WConfD.
  #. Expose each of those functions in WConfD as a separate RPC function.
     This will allow easy future extensions or modifications.
  #. Replace ``ConfigWriter`` with a stub (preferably automatically
     generated from the Haskell code) that will contain the same methods
     as the current ``ConfigWriter`` and delegate all calls to its
     methods to WConfD.

Step 3:
  #. Remove WConfD's RPC functions for obtaining/releasing the single
     internal lock from Step 1.
  #. Remove WConfD's RPC functions for sending/receiving the whole
     configuration from Step 1.
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446

Future aims:

-  Optionally refactor the RPC calls to reduce their number or improve their
   efficiency (for example by obtaining a larger set of data instead of
   querying items one by one).

Locking
+++++++

The new locking protocol will be implemented as follows:

Re-implement the current locking mechanism in WConfD and expose it for RPC
calls. All current locks will be mapped into a data structure that will
uniquely identify them (storing lock's level together with it's name).

WConfD will impose a linear order on locks. The order will be compatible
with the current ordering of lock levels so that existing code will work
without changes.

WConfD will keep the set of currently held locks for each client. The
protocol will allow the following operations on the set:

*Update:*
  Update the current set of locks according to a given list. The list contains
  locks and their desired level (release / shared / exclusive). To prevent
  deadlocks, WConfD will check that all newly requested locks (or already held
  locks requested to be upgraded to *exclusive*) are greater in the sense of
  the linear order than all currently held locks, and fail the operation if
  not. Only the locks in the list will be updated, other locks already held
  will be left intact. If the operation fails, the client's lock set will be
  left intact.
*Opportunistic union:*
  Add as much as possible locks from a given set to the current set within a
  given timeout. WConfD will again check the proper order of locks and
  acquire only the ones that are allowed wrt. the current set.  Returns the
  set of acquired locks, possibly empty. Immediate. Never fails. (It would also
  be possible to extend the operation to try to wait until a given number of
  locks is available, or a given timeout elapses.)
*List:*
  List the current set of held locks. Immediate, never fails.
*Intersection:*
  Retain only a given set of locks in the current one. This function is
  provided for convenience, it's redundant wrt. *list* and *update*. Immediate,
  never fails.

447
448
449
450
451
452
453
454
455
456
457
Addidional restrictions due to lock implications:
  Ganeti supports locks that act as if a lock on a whole group (like all nodes)
  were held. To avoid dead locks caused by the additional blockage of those
  group locks, we impose certain restrictions. Whenever `A` is a group lock and
  `B` belongs to `A`, then the following holds.

  - `A` is in lock order before `B`.
  - All locks that are in the lock order between `A` and `B` also belong to `A`.
  - It is considered a lock-order violation to ask for an exclusive lock on `B`
    while holding a shared lock on `A`.

458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
After this step it'll be possible to use locks from jobs as separate processes.

The above set of operations allows the clients to use various work-flows. In particular:

Pessimistic strategy:
  Lock all potentially relevant resources (for example all nodes), determine
  which will be needed, and release all the others.
Optimistic strategy:
  Determine what locks need to be acquired without holding any. Lock the
  required set of locks. Determine the set of required locks again and check if
  they are all held. If not, release everything and restart.

.. COMMENTED OUT:
  Start with the smallest set of locks and when determining what more
  relevant resources will be needed, expand the set. If an *union* operation
  fails, release all locks, acquire the desired union and restart the
  operation so that all preconditions and possible concurrent changes are
  checked again.

Future aims:

-  Add more fine-grained locks to prevent unnecessary blocking of jobs. This
   could include locks on parameters of entities or locks on their states (so that
   a node remains online, but otherwise can change, etc.). In particular,
   adding, moving and removing instances currently blocks the whole node.
-  Add checks that all modified configuration parameters belong to entities
   the client has locked and log violations.
-  Make the above checks mandatory.
-  Automate optimistic locking and checking the locks in logical units.
   For example, this could be accomplished by allowing some of the initial
   phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run
   repeatedly, checking if the set of locks requested the second time is
   contained in the set acquired after the first pass.
-  Add the possibility for a job to reserve hardware resources such as disk
   space or memory on nodes. Most likely as a new, special kind of instances
   that would only block its resources and allow to be converted to a regular
   instance. This would allow long-running jobs such as instance creation or
   move to lock the corresponding nodes, acquire the resources and turn the
   locks into shared ones, keeping an exclusive lock only on the instance.
-  Use more sophisticated algorithm for preventing deadlocks such as a
   `wait-for graph`_. This would allow less *union* failures and allow more
   optimistic, scalable acquisition of locks.

.. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph


Michele Tartara's avatar
Michele Tartara committed
504
505
506
507
508
Further considerations
======================

There is a possibility that a job will finish performing its task while LuxiD
and/or WConfD will not be available.
509
510
511
512
513
514
515
In order to deal with this situation, each job will update its job file
in the queue. This is race free, as LuxiD will no longer touch the job file,
once the job is started; a corollary of this is that the job also has to
take care of replicating updates to the job file. LuxiD will watch job files for
changes to determine when a job as cleanly finished. To determine jobs
that died without having the chance of updating the job file, the `Job death
detection`_ mechanism will be used.
Michele Tartara's avatar
Michele Tartara committed
516
517
518
519
520
521

.. vim: set textwidth=72 :
.. Local Variables:
.. mode: rst
.. fill-column: 72
.. End: