From 109e07c237d5a0c768fa8de43f985a71accc4511 Mon Sep 17 00:00:00 2001
From: Guido Trotter <ultrotter@google.com>
Date: Wed, 10 Oct 2012 12:07:15 +0200
Subject: [PATCH] Add cluster monitoring agent design document

This design addresses the lack of a uniform way to query ganeti nodes
for real time information that can be used by monitoring.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
---
 Makefile.am                     |   1 +
 doc/design-draft.rst            |   1 +
 doc/design-monitoring-agent.rst | 290 ++++++++++++++++++++++++++++++++
 3 files changed, 292 insertions(+)
 create mode 100644 doc/design-monitoring-agent.rst

diff --git a/Makefile.am b/Makefile.am
index fdecb35a3..85c38f27a 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -364,6 +364,7 @@ docrst = \
 	doc/design-resource-model.rst \
 	doc/design-shared-storage.rst \
 	doc/design-ssh-setup.rst \
+	doc/design-monitoring-agent.rst \
 	doc/design-virtual-clusters.rst \
 	doc/design-x509-ca.rst \
 	doc/devnotes.rst \
diff --git a/doc/design-draft.rst b/doc/design-draft.rst
index 36dba73fd..d08a6fac1 100644
--- a/doc/design-draft.rst
+++ b/doc/design-draft.rst
@@ -16,6 +16,7 @@ Design document drafts
    design-autorepair.rst
    design-partitioned.rst
    design-ssh-setup.rst
+   design-monitoring-agent.rst
    design-remote-commands.rst
 
 .. vim: set textwidth=72 :
diff --git a/doc/design-monitoring-agent.rst b/doc/design-monitoring-agent.rst
new file mode 100644
index 000000000..14fff7f45
--- /dev/null
+++ b/doc/design-monitoring-agent.rst
@@ -0,0 +1,290 @@
+=======================
+Ganeti monitoring agent
+=======================
+
+.. contents:: :depth: 4
+
+This is a design document detailing the implementation of a Ganeti
+monitoring agent report system, that can be queried by a monitoring
+system to calculate health information for a Ganeti cluster.
+
+Current state and shortcomings
+==============================
+
+There is currently no monitoring support in Ganeti. While we don't want
+to build something like Nagios or Pacemaker as part of Ganeti, it would
+be useful if such tools could easily extract information from a Ganeti
+machine in order to take actions (example actions include logging an
+outage for future reporting or alerting a person or system about it).
+
+Proposed changes
+================
+
+Each Ganeti node should export a status page that can be queried by a
+monitoring system. Such status page will be exported on a network port
+and will be encoded in JSON (simple text) over HTTP.
+
+The choice of json is obvious as we already depend on it in Ganeti and
+thus we don't need to add extra libraries to use it, as opposed to what
+would happen for XML or some other markup format.
+
+Location of agent report
+------------------------
+
+The report will be available from all nodes, and be concerned for all
+node-local resources. This allows more real-time information to be
+available, at the cost of querying all nodes.
+
+Information reported
+--------------------
+
+The monitoring agent system will report on the following basic information:
+
+- Instance status
+- Instance disk status
+- Status of storage for instances
+- Ganeti daemons status, CPU usage, memory footprint
+- Hypervisor resources report (memory, CPU, network interfaces)
+- Node OS resources report (memory, CPU, network interfaces)
+- Information from a plugin system
+
+Instance status
++++++++++++++++
+
+At the moment each node knows which instances are running on it, which
+instances it is primary for, but not the cause why an instance might not
+be running. On the other hand we don't want to distribute full instance
+"admin" status information to all nodes, because of the performance
+impact this would have.
+
+As such we propose that:
+
+- Any operation that can affect instance status will have an optional
+  "reason" attached to it (at opcode level). This can be used for
+  example to distinguish an admin request, from a scheduled maintenance
+  or an automated tool's work. If this reason is not passed, Ganeti will
+  just use the information it has about the source of the request: for
+  example a cli shutdown operation will have "cli:shutdown" as a reason,
+  a cli failover operation will have "cli:failover". Operations coming
+  from the remote API will use "rapi" instead of "cli". Of course
+  setting a real site-specific reason is still preferred.
+- RPCs that affect the instance status will be changed so that the
+  "reason" and the version of the config object they ran on is passed to
+  them. They will then export the new expected instance status, together
+  with the associated reason and object version to the status report
+  system, which then will export those themselves.
+
+Monitoring and auditing systems can then use the reason to understand
+the cause of an instance status, and they can use the object version to
+understand the freshness of their data even in the absence of an atomic
+cross-node reporting: for example if they see an instance "up" on a node
+after seeing it running on a previous one, they can compare these values
+to understand which data is freshest, and repoll the "older" node. Of
+course if they keep seeing this status this represents an error (either
+an instance continuously "flapping" between nodes, or an instance is
+constantly up on more than one), which should be reported and acted
+upon.
+
+The instance status will be on each node, for the instances it is
+primary for and will contain at least:
+
+- The instance name
+- The instance UUID (stable on name change)
+- The instance running status (up or down)
+- The timestamp of last known change
+- The timestamp of when the status was last checked (see caching, below)
+- The last known reason for change, if any
+
+More information about all the fields and their type will be available
+in the "Format of the report" section.
+
+Note that as soon as a node knows it's not the primary anymore for an
+instance it will stop reporting status for it: this means the instance
+will either disappear, if it has been deleted, or appear on another
+node, if it's been moved.
+
+Instance Disk status
+++++++++++++++++++++
+
+As for the instance status Ganeti has now only partial information about
+its instance disks: in particular each node is unaware of the disk to
+instance mapping, that exists only on the master.
+
+For this design doc we plan to fix this by changing all RPCs that create
+a backend storage or that put an already existing one in use and passing
+the relevant instance to the node. The node can then export these to the
+status reporting tool.
+
+While we haven't implemented these RPC changes yet, we'll use confd to
+fetch this information in the data collector.
+
+Since Ganeti supports many type of disks for instances (drbd, rbd,
+plain, file) we will export both a "generic" status which will work for
+any type of disk and will be very opaque (at minimum just an "healthy"
+or "error" state, plus perhaps some human readable comment and a
+"per-type" status which will explain more about the internal details but
+will not be compatible between different storage types (and will for
+example export the drbd connection status, sync, and so on).
+
+Status of storage for instances
++++++++++++++++++++++++++++++++
+
+The node will also be reporting on all storage types it knows about for
+the current node (this is right now hardcoded to the enabled storage
+types, and in the future tied to the enabled storage pools for the
+nodegroup). For this kind of information also we will report both a
+generic health status (healthy or error) for each type of storage, and
+some more generic statistics (free space, used space, total visible
+space). In addition type specific information can be exported: for
+example, in case of error, the nature of the error can be disclosed as a
+type specific information. Examples of these are "backend pv
+unavailable" for lvm storage, "unreachable" for network based storage or
+"filesystem error" for filesystem based implementations.
+
+Ganeti daemons status
++++++++++++++++++++++
+
+Ganeti will report what information it has about its own daemons: this
+includes memory usage, uptime, CPU usage. This should allow identifying
+possible problems with the Ganeti system itself: for example memory
+leaks, crashes and high resource utilization should be evident by
+analyzing this information.
+
+Ganeti daemons will also be able to export extra internal information to
+the status reporting, through the plugin system (see below).
+
+Hypervisor resources report
++++++++++++++++++++++++++++
+
+Each hypervisor has a view of system resources that sometimes is
+different than the one the OS sees (for example in Xen the Node OS,
+running as Dom0, has access to only part of those resources). In this
+section we'll report all information we can in a "non hypervisor
+specific" way. Each hypervisor can then add extra specific information
+that is not generic enough be abstracted.
+
+Node OS resources report
+++++++++++++++++++++++++
+
+Since Ganeti assumes it's running on Linux, it's useful to export some
+basic information as seen by the host system. This includes number and
+status of CPUs, memory, filesystems and network intefaces as well as the
+version of components Ganeti interacts with (Linux, drbd, hypervisor,
+etc).
+
+Note that we won't go into any hardware specific details (e.g. querying a
+node RAID is outside the scope of this, and can be implemented as a
+plugin) but we can easily just report the information above, since it's
+standard enough across all systems.
+
+Plugin system
++++++++++++++
+
+The monitoring system will be equipped with a plugin system that can
+export specific local information through it. The plugin system will be
+in the form of either scripts whose output will be inserted in the
+report, plain text files which will be inserted into the report, or
+local unix or network sockets from which the information has to be read.
+This should allow most flexibility for implementing an efficient system,
+while being able to keep it as simple as possible.
+
+The plugin system is expected to be used by local installations to
+export any installation specific information that they want to be
+monitored, about either hardware or software on their systems.
+
+
+Format of the query
+-------------------
+
+The query will be an HTTP GET request on a particular port. At the
+beginning it will only be possible to query the full status report.
+
+
+Format of the report
+--------------------
+
+TBD (this part needs to be completed with the format of the JSON and the
+types of the various variables exported, as they get evaluated and
+decided)
+
+
+Data collectors
+---------------
+
+In order to ease testing as well as to make it simple to reuse this
+subsystem it will be possible to run just the "data collectors" on each
+node without passing through the agent daemon. Each data collector will
+report specific data about its subsystem and will be documented
+separately.
+
+
+Mode of operation
+-----------------
+
+In order to be able to report information fast the monitoring agent
+daemon will keep an in-memory or on-disk cache of the status, which will
+be returned when queries are made. The status system will then
+periodically check resources to make sure the status is up to date.
+
+Different parts of the report will be queried at different speeds. These
+will depend on:
+- how often they vary (or we expect them to vary)
+- how fast they are to query
+- how important their freshness is
+
+Of course the last parameter is installation specific, and while we'll
+try to have defaults, it will be configurable. The first two instead we
+can use adaptively to query a certain resource faster or slower
+depending on those two parameters.
+
+
+Implementation place
+--------------------
+
+The status daemon will be implemented as a standalone Haskell daemon. In
+the future it should be easy to merge multiple daemons into one with
+multiple entry points, should we find out it saves resources and doesn't
+impact functionality.
+
+The libekg library should be looked at for easily providing metrics in
+json format.
+
+
+Implementation order
+--------------------
+
+We will implement the agent system in this order:
+
+- initial example data collectors (eg. for drbd and instance status)
+- initial daemon for exporting data
+- RPC updates for instance status reasons and disk to instance mapping
+- more data collectors
+- cache layer for the daemon (if needed)
+
+
+Future work
+===========
+
+As a future step it can be useful to "centralize" all this reporting
+data on a single place. This for example can be just the master node, or
+all the master candidates. We will evaluate doing this after the first
+node-local version has been developed and tested.
+
+Another possible change is replacing the "read-only" RPCs with queries
+to the agent system, thus having only one way of collecting information
+from the nodes from a monitoring system and for Ganeti itself.
+
+One extra feature we may need is a way to query for only sub-parts of
+the report (eg. instances status only). This can be done by passing
+arguments to the HTTP GET, which will be defined when we get to this
+funtionality.
+
+Finally the :doc:`autorepair system design <design-autorepair>`. system
+(see its design) can be expanded to use the monitoring agent system as a
+source of information to decide which repairs it can perform.
+
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End:
-- 
GitLab