From aa355c7954aa601d57be8d69b6d2582e155d373c Mon Sep 17 00:00:00 2001
From: Luca Bigliardi <shammash@google.com>
Date: Wed, 12 May 2010 16:02:36 +0100
Subject: [PATCH] Node daemon availability improvements proposal

Signed-off-by: Luca Bigliardi <shammash@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
---
 doc/design-2.1.rst | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/doc/design-2.1.rst b/doc/design-2.1.rst
index 12b01ce29..3be50f076 100644
--- a/doc/design-2.1.rst
+++ b/doc/design-2.1.rst
@@ -292,6 +292,35 @@ wasn't closed during the timeout, the waiting function returns to its
 caller nonetheless.
 
 
+Node daemon availability
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+Currently, when a Ganeti node suffers serious system disk damage, the
+migration/failover of an instance may not correctly shutdown the virtual
+machine on the broken node causing instances duplication. The ``gnt-node
+powercycle`` command can be used to force a node reboot and thus to
+avoid duplicated instances. This command relies on node daemon
+availability, though, and thus can fail if the node daemon has some
+pages swapped out of ram, for example.
+
+
+Proposed changes
+++++++++++++++++
+
+The proposed solution forces node daemon to run exclusively in RAM. It
+uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on
+the node daemon process and all its children. In addition another log
+handler has been implemented for node daemon to redirect to
+``/dev/console`` messages that cannot be written on the logfile.
+
+With these changes node daemon can successfully run basic tasks such as
+a powercycle request even when the system disk is heavily damaged and
+reading/writing to disk fails constantly.
+
+
 Feature changes
 ---------------
 
-- 
GitLab