Commit 6d2e1c12 authored by Michele Tartara's avatar Michele Tartara

Add design document for the "reason trail"

This commit adds the design document for introducing "reason trails",
tracing the reason why opcodes are executed, step by step.
Signed-off-by: default avatarMichele Tartara <mtartara@google.com>
Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
parent f511082f
......@@ -398,6 +398,7 @@ docinput = \
doc/design-partitioned.rst \
doc/design-query-splitting.rst \
doc/design-query2.rst \
doc/design-reason-trail.rst \
doc/design-resource-model.rst \
doc/design-restricted-commands.rst \
doc/design-shared-storage.rst \
......
......@@ -17,6 +17,7 @@ Design document drafts
design-monitoring-agent.rst
design-hroller.rst
design-storagespace.rst
design-reason-trail.rst
.. vim: set textwidth=72 :
.. Local Variables:
......
===================
Ganeti reason trail
===================
.. contents:: :depth: 2
This is a design document detailing the implementation of a way for Ganeti to
track the origin and the reason of every executed command, from its starting
point (command line, remote API, some htool, etc.) to its actual execution
time.
Current state and shortcomings
==============================
There is currently no way to track why a job and all the operations part of it
were executed, and who or what triggered the execution.
This is an inconvenience in general, and also it makes impossible to have
certain information, such as finding the reason why an instance last changed its
status (i.e.: why it was started/stopped/rebooted/etc.), or distinguishing
an admin request from a scheduled maintenance or an automated tool's work.
Proposed changes
================
We propose to introduce a new piece of information, that will be called "reason
trail", to track the path from the issuing of a command to its execution.
The reason trail will be a list of 3-tuples ``(source, reason, timestamp)``,
with:
``source``
The entity deciding to perform (or forward) a command.
It is represented by an arbitrary string, but strings prepended by "gnt:"
are reserved for Ganeti components, and they will be refused by the
interfaces towards the external world.
``reason``
The reason why the entity decided to perform the operation.
It is represented by an arbitrary string. The string might possibly be empty,
because certain components of the system might just "pass on" the operation
(therefore wanting to be recorded in the trail) but without an explicit
reason.
``timestamp``
The time when the element was added to the reason trail. It has to be
expressed in nanoseconds since the unix epoch (0:00:00 January 01, 1970).
If not enough precision is available (or needed) it can be padded with
zeroes.
The reason trail will be attached at the OpCode level. When it has to be
serialized externally (such as on the RAPI interface), it will be serialized in
JSON format. Specifically, it will be serialized as a list of elements.
Each element will be a list with two strings (for ``source`` and ``reason``)
and one integer number (the ``timestamp``).
Any component the operation goes through is allowed (but not required) to append
it's own reason to the list. Other than this, the list shouldn't be modified.
As an example here is the reason trail for a shutdown operation invoked from
the command line through the gnt-instance tool::
[("user", "Cleanup of unused instances", 1363088484000000000),
("gnt:client:gnt-instance", "stop", 1363088484020000000),
("gnt:opcode:shutdown", "job=1234;index=0", 1363088484026000000),
("gnt:daemon:noded:shutdown", "", 1363088484135000000)]
where the first 3-tuple is determined by a user-specified message, passed to
gnt-instance through a command line parameter.
The same operation, launched by an external GUI tool, and executed through the
remote API, would have a reason trail like::
[("user", "Cleanup of unused instances", 1363088484000000000),
("other-app:tool-name", "gui:stop", 1363088484000300000),
("gnt:client:rapi:shutdown", "", 1363088484020000000),
("gnt:library:rlib2:shutdown", "", 1363088484023000000),
("gnt:opcode:shutdown", "job=1234;index=0", 1363088484026000000),
("gnt:daemon:noded:shutdown", "", 1363088484135000000)]
Implementation
==============
The OpCode base class will be modified to include a new field, OP_REASON.
This will receive the reason trail as built by all the previous steps.
When an OpCode is added to a job (in jqueue.py) the job number and the opcode
index will be recorded as the reason for the existence of that opcode.
The implementation of this design will start from the operations that affect the
instance status. They will be changed so that the "reason" is passed to them.
They will then export the new expected instance status, together
with the associated reason for the monitoring daemon.
.. vim: set textwidth=72 :
.. Local Variables:
.. mode: rst
.. fill-column: 72
.. End:
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment