diff --git a/Makefile.am b/Makefile.am index 412e8033c459ee4eaa1ec4b5a450da62f90a7bda..98d8a4f9e7c565b87846e431f824fbd25ae0c705 100644 --- a/Makefile.am +++ b/Makefile.am @@ -310,6 +310,7 @@ docrst = \ doc/cluster-merge.rst \ doc/design-shared-storage.rst \ doc/design-node-state-cache.rst \ + doc/design-virtual-clusters.rst \ doc/devnotes.rst \ doc/glossary.rst \ doc/hooks.rst \ diff --git a/doc/design-draft.rst b/doc/design-draft.rst index 2f05103718347445acc93f7fc02fc57ffa0b91e3..c349092f975a84810f878f26469e4682c2babde4 100644 --- a/doc/design-draft.rst +++ b/doc/design-draft.rst @@ -13,6 +13,7 @@ Design document drafts design-network.rst design-node-state-cache.rst design-resource-model.rst + design-virtual-clusters.rst .. vim: set textwidth=72 : .. Local Variables: diff --git a/doc/design-virtual-clusters.rst b/doc/design-virtual-clusters.rst new file mode 100644 index 0000000000000000000000000000000000000000..2877c0e76b3b443cdcb5a560135d9b8406fcf438 --- /dev/null +++ b/doc/design-virtual-clusters.rst @@ -0,0 +1,245 @@ +========================== + Virtual clusters support +========================== + + +Introduction +============ + +Currently there are two ways to test the Ganeti (including HTools) code +base: + +- unittests, which run using mocks as normal user and test small bits of + the code +- QA/burnin/live-test, which require actual hardware (either physical or + virtual) and will build an actual cluster, with one machine to one + node correspondence + +The difference in time between these two is significant: + +- the unittests run in about 1-2 minutes +- a so-called βquickβ QA (without burnin) runs in about an hour, and a + full QA could be double that time + +On one hand, the unittests have a clear advantage: quick to run, not +requiring many machines, but on the other hand QA is actually able to +run end-to-end tests (including HTools, for example). + +Ideally, we would have an intermediate step between these two extremes: +be able to test most, if not all, of Ganeti's functionality but without +requiring actual hardware, full machine ownership or root access. + + +Current situation +================= + +Ganeti +------ + +It is possible, given a manually built ``config.data`` and +``_autoconf.py``, to run the masterd under the current user as a +single-node cluster master. However, the node daemon and related +functionality (cluster initialisation, master failover, etc.) are not +directly runnable in this model. + +Also, masterd only works as a master of a single node cluster, due to +our current βhostnameβ method of identifying nodes, which results in a +limit of maximum one node daemon per machine, unless we use multiple +name and IP aliases. + +HTools +------ + +In HTools the situation is better, since it doesn't have to deal with +actual machine management: all tools can use a custom LUXI path, and can +even load RAPI data from the filesystem (so the RAPI backend can be +tested), and both the βtextβ backend for hbal/hspace and the input files +for hail are text-based, loaded from the file-system. + +Proposed changes +================ + +The end-goal is to have full support for βvirtual clustersβ, i.e. be +able to run a βbigβ (hundreds of virtual nodes and towards thousands of +virtual instances) on a reasonably powerful, but single machine, under a +single user account and without any special privileges. + +This would have significant advantages: + +- being able to test end-to-end certain changes, without requiring a + complicated setup +- better able to estimate Ganeti's behaviour and performance as the + cluster size grows; this is something that we haven't been able to + test reliably yet, and as such we still have not yet diagnosed + scaling problems +- easier integration with external tools (and even with HTools) + +``masterd`` +----------- + +As described above, ``masterd`` already works reasonably well in a +virtual setup, as it won't execute external programs and it shouldn't +directly read files from the local filesystem (or at least not +virtualisation-related, as the master node can be a non-vm_capable +node). + +``noded`` +--------- + +The node daemon executes many privileged operations, but they can be +split in a few general categories: + ++---------------+-----------------------+------------------------------------+ +|Category |Description |Solution | ++===============+=======================+====================================+ +|disk operations|Disk creation and |Use only diskless or file-based | +| |removal |instances | ++---------------+-----------------------+------------------------------------+ +|disk query |Node disk total/free, |Not supported currently, could use | +| |used in node listing |file-based | +| |and htools | | ++---------------+-----------------------+------------------------------------+ +|hypervisor |Instance start, stop |Use the *fake* hypervisor | +|operations |and query | | ++---------------+-----------------------+------------------------------------+ +|instance |Bridge existence query |Unprivileged operation, can be used | +|networking | |with an existing bridge at system | +| | |level or use NIC-less instances | ++---------------+-----------------------+------------------------------------+ +|instance OS |OS add, OS rename, |Only used with non diskless | +|operations |export and import |instances; could work with custom OS| +| | |scripts (that just ``dd`` without | +| | |mounting filesystems | ++---------------+-----------------------+------------------------------------+ +|node networking|IP address management |Not supported; Ganeti will need to | +| |(master ip), IP query, |work without a master IP. For the IP| +| |etc. |query operations, the test machine | +| | |would need externally-configured IPs| ++---------------+-----------------------+------------------------------------+ +|node setup |ssh, /etc/hosts, so on |Can already be disabled from the | +| | |cluster config | ++---------------+-----------------------+------------------------------------+ +|master failover|start/stop the master |Doable (as long as we use a single | +| |daemon |user), might get tricky w.r.t. paths| +| | |to executables | ++---------------+-----------------------+------------------------------------+ +|file upload |Uploading of system |The only issue could be with system | +| |files, job queue files |files, which are not owned by the | +| |and ganeti config |current user; internal ganeti files | +| | |should be working fine | ++---------------+-----------------------+------------------------------------+ +|node oob |Out-of-band commands |Since these are user-defined, we can| +| | |mock them easily | ++---------------+-----------------------+------------------------------------+ +|node OS |List the existing OSes |No special privileges needed, so | +|discovery |and their properties |works fine as-is | ++---------------+-----------------------+------------------------------------+ +|hooks |Running hooks for given|No special privileges needed | +| |operations | | ++---------------+-----------------------+------------------------------------+ +|iallocator |Calling an iallocator |No special privileges needed | +| |script | | ++---------------+-----------------------+------------------------------------+ +|export/import |Exporting and importing|When exporting/importing file-based | +| |instances |instances, this should work, as the | +| | |listening ports are dynamically | +| | |chosen | ++---------------+-----------------------+------------------------------------+ +|hypervisor |The validation of |As long as the hypervisors don't | +|validation |hypervisor parameters |call to privileged commands, it | +| | |should work | ++---------------+-----------------------+------------------------------------+ +|node powercycle|The ability to power |Privileged, so not supported, but | +| |cycle a node remotely |anyway not very interesting for | +| | |testing | ++---------------+-----------------------+------------------------------------+ + +It seems that much of the functionality works as is, or could work with +small adjustments, even in a non-privileged setup. The bigger problem is +the actual use of multiple node daemons per machine. + +Multiple ``noded`` per machine +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently Ganeti identifies node simply by their hostname. Since +changing this method would imply significant changes to tracking the +nodes, the proposal is to simply have as many IPs per the (single) +machine that is used for tests as nodes, and have each IP correspond to +a different name, and thus no changes are needed to the core RPC +library. Unfortunately this has the downside of requiring root rights +for setting up the extra IPs and hostnames. + +An alternative option is to implement per-node IP/port support in Ganeti +(especially in the RPC layer), which would eliminate the root rights. We +expect that this will get implemented as a second step of this design. + +The only remaining problem is with sharing the ``localstatedir`` +structure (lib, run, log) amongst the daemons, for which we propose to +add a command line parameter which can override this path (via injection +into ``_autoconf.py``). The rationale for this is two-fold: + +- having two or more node daemons writing to the same directory might + introduce artificial scenarios not existent in real life; currently + noded either owns the entire ``/var/lib/ganeti`` directory or shares + it with masterd, but never with another noded +- having separate directories allows cluster verify to check correctly + consistency of file upload operations; otherwise, as long as one node + daemon wrote a file successfully, the results from all others are + βlostβ + + +``rapi`` +-------- + +The RAPI daemon is not privileged and furthermore we only need one per +cluster, so it presents no issues. + +``confd`` +--------- + +``confd`` has somewhat the same issues as the node daemon regarding +multiple daemons per machine, but the per-address binding still works. + +``ganeti-watcher`` +------------------ + +Since the startup of daemons will be customised with per-IP binds, the +watcher either has to be modified to not activate the daemons, or the +start-stop tool has to take this into account. Due to watcher's use of +the hostname, it's recommended that the master node is set to the +machine hostname (also a requirement for the master daemon). + +CLI scripts +----------- + +As long as the master node is set to the machine hostname, these should +work fine. + +Cluster initialisation +---------------------- + +It could be possible that the cluster initialisation procedure is a bit +more involved (this was not tried yet). In any case, we can build a +``config.data`` file manually, without having to actually run +``gnt-cluster init``. + +Needed tools +============ + +With the above investigation results in mind, the only thing we need +are: + +- a tool to setup per-virtual node tree structure of ``localstatedir`` + and setup correctly the extra IP/hostnames +- changes to the startup daemon tools to launch correctly the daemons + per virtual node +- changes to ``noded`` to override the ``localstatedir`` path +- documentation for running such a virtual cluster +- and eventual small fixes to the node daemon backend functionality, to + better separate privileged and non-privileged code + +.. vim: set textwidth=72 : +.. Local Variables: +.. mode: rst +.. fill-column: 72 +.. End: