diff --git a/doc/Makefile.am b/doc/Makefile.am index 61db0873d87efb3b9606fe70ccb977fe3277d87a..e8419de09fd0ce4641044ba6a7f7a7c4635ff3d9 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -4,8 +4,10 @@ SUBDIRS = examples dist_doc_DATA = \ hooks.html hooks.pdf \ install.html install.pdf \ - admin.html admin.pdf -EXTRA_DIST = hooks.sgml install.sgml admin.sgml + admin.html admin.pdf \ + iallocator.html iallocator.pdf + +EXTRA_DIST = hooks.sgml install.sgml admin.sgml iallocator.sgml MAINTAINERCLEANFILES = *.html *.pdf %.sgmltmp: %.sgml diff --git a/doc/iallocator.sgml b/doc/iallocator.sgml new file mode 100644 index 0000000000000000000000000000000000000000..3d0e5af248f9be3ace55ff2e01b36aade28b4fba --- /dev/null +++ b/doc/iallocator.sgml @@ -0,0 +1,577 @@ +<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [ +]> + <article class="specification"> + <articleinfo> + <title>Ganeti automatic instance allocation</title> + </articleinfo> + <para>Documents Ganeti version 1.2</para> + <sect1> + <title>Introduction</title> + + <para>Currently in Ganeti the admin has to specify the exact + locations for an instance's node(s). This prevents a completely + automatic node evacuation, and is in general a nuisance.</para> + + <para>The <acronym>iallocator</acronym> framework will enable + automatic placement via external scripts, which allows + customization of the cluster layout per the site's + requirements.</para> + + </sect1> + + <sect1> + <title>User-visible changes</title> + + <para>There are two parts of the ganeti operation that are + impacted by the auto-allocation: how the cluster knows what the + allocator algorithms are and how the admin uses these in creating + instances.</para> + + <para>An allocation algorithm is just the filename of a program + installed in a defined list of directories.</para> + + <sect2> + <title>Cluster configuration</title> + + <para>At configure time, the list of the directories can be + selected via the + <option>--with-iallocator-search-path=LIST</option> option, + where <userinput>LIST</userinput> is a comma-separated list of + directories. If not given, this defaults to + <constant>$libdir/ganeti/iallocators</constant>, i.e. for an + installation under <filename class="directory">/usr</filename>, + this will be <filename + class="directory">/usr/lib/ganeti/iallocators</filename>.</para> + + <para>Ganeti will then search for allocator script in the + configured list, using the first one whose filename matches the + one given by the user.</para> + + </sect2> + + <sect2> + <title>Command line interface changes</title> + + <para>The node selection options in instanece add and instance + replace disks can be replace by the new <option>--iallocator + <replaceable>NAME</replaceable></option> option, which will + cause the autoassignation. The selected node(s) will be show as + part of the command output.</para> + + </sect2> + + </sect1> + + <sect1> + <title>IAllocator API</title> + + <para>The protocol for communication between Ganeti and an + allocator script will be the following:</para> + + <orderedlist> + <listitem> + <simpara>ganeti launches the program with a single argument, a + filename that contains a JSON-encoded structure (the input + message)</simpara> + </listitem> + <listitem> + <simpara>if the script finishes with exit code different from + zero, it is considered a general failure and the full output + will be reported to the users; this can be the case when the + allocator can't parse the input message;</simpara> + </listitem> + <listitem> + <simpara>if the allocator finishes with exit code zero, it is + expected to output (on its stdout) a JSON-encoded structure + (the response)</simpara> + </listitem> + </orderedlist> + + <sect2> + <title>Input message</title> + + <para>The input message will be the JSON encoding of a + dictionary containing the following:</para> + + <variablelist> + <varlistentry> + <term>version</term> + <listitem> + <simpara>the version of the protocol; this document + specifies version 1</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>cluster_name</term> + <listitem> + <simpara>the cluster name</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>cluster_tags</term> + <listitem> + <simpara>the list of cluster tags</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>request</term> + <listitem> + <simpara>a dictionary containing the request data:</simpara> + <variablelist> + <varlistentry> + <term>type</term> + <listitem> + <simpara>the request type; this can be either + <literal>allocate</literal> or + <literal>relocate</literal>; the + <literal>allocate</literal> request is used when a + new instance needs to be placed on the cluster, + while the <literal>relocate</literal> request is + used when an existing instance needs to be moved + within the cluster</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>name</term> + <listitem> + <simpara>the name of the instance; if the request is + a realocation, then this name will be found in the + list of instances (see below), otherwise is the + <acronym>FQDN</acronym> of the new + instance</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>required_nodes</term> + <listitem> + <simpara>how many nodes should the algorithm return; + while this information can be deduced from the + instace's disk template, it's better if this + computation is left to Ganeti as then allocator + scripts are less sensitive to changes to the disk + templates</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>disk_space_total</term> + <listitem> + <simpara>the total disk space that will be used by + this instance on the (new) nodes; again, this + information can be computed from the list of + instance disks and its template type, but Ganeti is + better suited to compute it</simpara> + </listitem> + </varlistentry> + </variablelist> + <simpara>If the request is an allocation, then there are + extra fields in the request dictionary:</simpara> + <variablelist> + <varlistentry> + <term>disks</term> + <listitem> + <simpara>list of dictionaries holding the disk + definitions for this instance (in the order they are + exported to the hypervisor):</simpara> + <variablelist> + <varlistentry> + <term>mode</term> + <listitem> + <simpara>either <literal>w</literal> or + <literal>w</literal> denoting if the disk is + read-only or writable; for Ganeti 1.2, this + will always be <literal>w</literal</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>size</term> + <listitem> + <simpara>the size of this disk in mebibyte</simpara> + </listitem> + </varlistentry> + </variablelist> + </listitem> + </varlistentry> + <varlistentry> + <term>nics</term> + <listitem> + <simpara>a list of dictionaries holding the network + interfaces for this instance, containing:</simpara> + <variablelist> + <varlistentry> + <term>ip</term> + <listitem> + <simpara>the IP address that Ganeti know for + this instance, or null</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>mac</term> + <listitem> + <simpara>the MAC address for this interface</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>bridge</term> + <listitem> + <simpara>the bridge to which this interface + will be connected</simpara> + </listitem> + </varlistentry> + </variablelist> + </listitem> + </varlistentry> + <varlistentry> + <term>vcpus</term> + <listitem> + <simpara>the number of VCPUs for the instance</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>disk_template</term> + <listitem> + <simpara>the disk template for the instance</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>memory</term> + <listitem> + <simpara>the memory size for the instance</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>os</term> + <listitem> + <simpara>the OS type for the instance</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>tags</term> + <listitem> + <simpara>the list of the instance's tags</simpara> + </listitem> + </varlistentry> + </variablelist> + <simpara>If the request is of type relocate, then there is + one more entry in the request dictionary, named + <varname>relocate_from</varname>, and it contains a list + of nodes to move the instance away from; note that with + Ganeti 1.2, this list will always contain a single node, + the current secondary of the instance.</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>instances</term> + <listitem> + <simpara>a dictionary with the data for the current + existing instance on the cluster, indexed by instance + name; the contents are similar to the instance definitions + for the allocate mode, with the addition of:</simpara> + <variablelist> + <varlistentry> + <term>should_run</term> + <listitem> + <simpara>if this instance is set to run (but not the + actual status of the instance)</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>nodes</term> + <listitem> + <simpara>list of nodes on which this instance is + placed; the primary node of the instance is always + the first one</simpara> + </listitem> + </varlistentry> + </variablelist> + </listitem> + </varlistentry> + <varlistentry> + <term>nodes</term> + <listitem> + <simpara>dictionary with the data for the nodes in the + cluster, indexed by the node name; the dict + contains:</simpara> + <variablelist> + <varlistentry> + <term>total_disk</term> + <listitem> + <simpara>the total disk size of this node + (mebibytes)</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>free_disk</term> + <listitem> + <simpara>the free disk space on the node</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>total_memory</term> + <listitem> + <simpara>the total memory size</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>free_memory</term> + <listitem> + <simpara>free memory on the node; note that + currently this does not take into account the + instances which are down on the node</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>primary_ip</term> + <listitem> + <simpara>the primary IP address of the + node</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>secondary_ip</term> + <listitem> + <simpara>the secondary IP address of the node (the + one used for the DRBD replication); note that this + can be the same as the primary one</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>tags</term> + <listitem> + <simpara>list with the tags of the node</simpara> + </listitem> + </varlistentry> + </variablelist> + </listitem> + </varlistentry> + </variablelist> + + </sect2> + + <sect2> + <title>Respone message</title> + + <para>The response message is much more simple than the input + one. It is also a dict having three keys:</para> + <variablelist> + <varlistentry> + <term>success</term> + <listitem> + <simpara>a boolean value denoting if the allocation was + successfull or not</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>info</term> + <listitem> + <simpara>a string with information from the scripts; if + the allocation fails, this will be shown to the + user</simpara> + </listitem> + </varlistentry> + <varlistentry> + <term>nodes</term> + <listitem> + <simpara>the list of nodes computed by the algorithm; even + if the algorithm failed (i.e. success is false), this must + be returned as an empty list; also note that the length of + this list must equal the + <varname>requested_nodes</varname> entry in the input + message, otherwise Ganeti will consider the result as + failed</simpara> + </listitem> + </varlistentry> + </variablelist> + </sect2> + </sect1> + + <sect1> + <title>Examples</title> + <sect2> + <title>Input messages to scripts</title> + <simpara>Input message, new instance allocation</simpara> + <screen> +{ + "cluster_tags": [], + "request": { + "required_nodes": 2, + "name": "instance3.example.com", + "tags": [ + "type:test", + "owner:foo" + ], + "type": "allocate", + "disks": [ + { + "mode": "w", + "size": 1024 + }, + { + "mode": "w", + "size": 2048 + } + ], + "nics": [ + { + "ip": null, + "mac": "00:11:22:33:44:55", + "bridge": null + } + ], + "vcpus": 1, + "disk_template": "drbd", + "memory": 2048, + "disk_space_total": 3328, + "os": "etch-image" + }, + "cluster_name": "cluster1.example.com", + "instances": { + "instance1.example.com": { + "tags": [], + "should_run": false, + "disks": [ + { + "mode": "w", + "size": 64 + }, + { + "mode": "w", + "size": 512 + } + ], + "nics": [ + { + "ip": null, + "mac": "aa:00:00:00:60:bf", + "bridge": "xen-br0" + } + ], + "vcpus": 1, + "disk_template": "plain", + "memory": 128, + "nodes": [ + "nodee1.com" + ], + "os": "etch-image" + }, + "instance2.example.com": { + "tags": [], + "should_run": false, + "disks": [ + { + "mode": "w", + "size": 512 + }, + { + "mode": "w", + "size": 256 + } + ], + "nics": [ + { + "ip": null, + "mac": "aa:00:00:55:f8:38", + "bridge": "xen-br0" + } + ], + "vcpus": 1, + "disk_template": "drbd", + "memory": 512, + "nodes": [ + "node2.example.com", + "node3.example.com" + ], + "os": "etch-image" + } + }, + "version": 1, + "nodes": { + "node1.example.com": { + "total_disk": 858276, + "primary_ip": "192.168.1.1", + "secondary_ip": "192.168.2.1", + "tags": [], + "free_memory": 3505, + "free_disk": 856740, + "total_memory": 4095 + }, + "node2.example.com": { + "total_disk": 858240, + "primary_ip": "192.168.1.3", + "secondary_ip": "192.168.2.3", + "tags": ["test"], + "free_memory": 3505, + "free_disk": 848320, + "total_memory": 4095 + }, + "node3.example.com.com": { + "total_disk": 572184, + "primary_ip": "192.168.1.3", + "secondary_ip": "192.168.2.3", + "tags": [], + "free_memory": 3505, + "free_disk": 570648, + "total_memory": 4095 + } + } +} +</screen> + <simpara>Input message, reallocation. Since only the request + entry in the input message is changed, the following shows only + this entry:</simpara> + <screen> + "request": { + "relocate_from": [ + "node3.example.com" + ], + "required_nodes": 1, + "type": "relocate", + "name": "instance2.example.com", + "disk_space_total": 832 + }, +</screen> + + </sect2> + <sect2> + <title>Response messages</title> + <simpara>Successful response message:</simpara> + <screen> +{ + "info": "Allocation successful", + "nodes": [ + "node2.example.com", + "node1.example.com" + ], + "success": true +} +</screen> + <simpara>Failed response message:</simpara> + <screen> +{ + "info": "Can't find a suitable node for position 2 (already selected: node2.example.com)", + "nodes": [], + "success": false +} +</screen> + </sect2> + + <sect2> + <title>Command line messages</title> + <screen> +# gnt-instance add -t plain -m 2g --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance3 +Selected nodes for the instance: node1.example.com +* creating instance disks... +[...] + +# gnt-instance add -t plain -m 3400m --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance4 +Failure: prerequisites not met for this operation: +Can't compute nodes using iallocator 'dumb-allocator': Can't find a suitable node for position 1 (already selected: ) + +# gnt-instance add -t drbd -m 1400m --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance5 +Failure: prerequisites not met for this operation: +Can't compute nodes using iallocator 'dumb-allocator': Can't find a suitable node for position 2 (already selected: node1.example.com) + +</screen> + </sect2> + </sect1> + + </article>