From 282f38e32aed147c4debe6163761749cf462214a Mon Sep 17 00:00:00 2001 From: Guido Trotter <ultrotter@google.com> Date: Wed, 25 Aug 2010 12:13:47 +0100 Subject: [PATCH] Node groups design doc For the first version we should be able to implement node groups without any backend api changes (ie. Iallocator). Yikes! Signed-off-by: Guido Trotter <ultrotter@google.com> Reviewed-by: Iustin Pop <iustin@google.com> --- doc/design-2.3.rst | 131 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) diff --git a/doc/design-2.3.rst b/doc/design-2.3.rst index 01282fcc1..e45a0772d 100644 --- a/doc/design-2.3.rst +++ b/doc/design-2.3.rst @@ -17,6 +17,137 @@ As for 2.1 and 2.2 we divide the 2.3 design into three areas: Core changes ============ +Node Groups +----------- + +Current state and shortcomings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently all nodes of a Ganeti cluster are considered as part of the +same pool, for allocation purposes: DRBD instances for example can be +allocated on any two nodes. + +This does cause a problem in cases where nodes are not all equally +connected to each other. For example if a cluster is created over two +set of machines, each connected to its own switch, the internal bandwidth +between machines connected to the same switch might be bigger than the +bandwidth for inter-switch connections. + +Moreover some operations inside a cluster require all nodes to be locked +together for inter-node consistency, and won't scale if we increase the +number of nodes to a few hundreds. + +Proposed changes +~~~~~~~~~~~~~~~~ + +With this change we'll divide Ganeti nodes into groups. Nothing will +change for clusters with only one node group, the default one. Bigger +cluster instead will be able to have more than one group, and each node +will belong to exactly one. + +Node group management ++++++++++++++++++++++ + +To manage node groups and the nodes belonging to them, the following new +commands/flags will be introduced:: + + gnt-node group-add <group> # add a new node group + gnt-node group-del <group> # delete an empty group + gnt-node group-list # list node groups + gnt-node group-rename <oldname> <newname> # rename a group + gnt-node list/info -g <group> # list only nodes belongin to a group + gnt-node add -g <group> # add a node to a certain group + gnt-node modify -g <group> # move a node to a new group + +Instance level changes +++++++++++++++++++++++ + +Instances will be able to live in only one group at a time. This is +mostly important for DRBD instances, in which case both their primary +and secondary nodes will need to be in the same group. To support this +we envision the following changes: + + - The cluster will have a default group, which will initially be + - Instance allocation will happen to the cluster's default group + (which will be changable via gnt-cluster modify or RAPI) unless a + group is explicitely specified in the creation job (with -g or via + RAPI). Iallocator will be only passed the nodes belonging to that + group. + - Moving an instance between groups can only happen via an explicit + operation, which for example in the case of DRBD will work by + performing internally a replace-disks, a migration, and a second + replace-disks. It will be possible to cleanup an interrupted + group-move operation. + - Cluster verify will signal an error if an instance has been left + mid-transition between groups. + - Intra-group instance migration/failover will check that the target + group will be able to accept the instance network/storage wise, and + fail otherwise. In the future we may be able to make some parameter + changed during the move, but in the first version we expect an + import/export if this is not possible. + - From an allocation point of view, inter-group movements will be + shown to a iallocator as a new allocation over the target group. + Only in a future version we may add allocator extensions to decide + which group the instance should be in. In the meantime we expect + Ganeti administrators to either put instances on different groups by + filling all groups first, or to have their own strategy based on the + instance needs. + +Cluster/Internal/Config level changes ++++++++++++++++++++++++++++++++++++++ + +We expect the following changes for cluster management: + + - Frequent multinode operations, such as os-diagnose or cluster-verify + will act one group at a time. The default group will be used if none + is passed. Command line tools will have a way to easily target all + groups, by generating one job per group. + - Groups will have a human-readable name, but will internally always + be referenced by a UUID, which will be immutable. For example the + cluster object will contain the UUID of the default group, each node + will contain the UUID of the group it belongs to, etc. This is done + to simplify referencing while keeping it easy to handle renames and + movements. If we see that this works well, we'll transition other + config objects (instances, nodes) to the same model. + - The addition of a new per-group lock will be evaluated, if we can + transition some operations now requiring the BGL to it. + - Master candidate status will be allowed to be spread among groups. + For the first version we won't add any restriction over how this is + done, although in the future we may have a minimum number of master + candidates which Ganeti will try to keep in each group, for example. + +Other work and future changes ++++++++++++++++++++++++++++++ + +Commands like gnt-cluster command/copyfile will continue to work on the +whole cluster, but it will be possible to target one group only by +specifying it. + +Commands which allow selection of sets of resources (for example +gnt-instance start/stop) will be able to select them by node group as +well. + +Initially node groups won't be taggable objects, to simplify the first +implementation, but we expect this to be easy to add in a future version +should we see it's useful. + +We envision groups as a good place to enhance cluster scalability. In +the future we may want to use them ad units for configuration diffusion, +to allow a better master scalability. For example it could be possible +to change some all-nodes RPCs to contact each group once, from the +master, and make one node in the group perform internal diffusion. We +won't implement this in the first version, but we'll evaluate it for the +future, if we see scalability problems on big multi-group clusters. + +When Ganeti will support more storage models (eg. SANs, sheepdog, ceph) +we expect groups to be the basis for this, allowing for example a +different sheepdog/ceph cluster, or a different SAN to be connected to +each group. In some cases this will mean that inter-group move operation +will be necessarily performed with instance downtime, unless the +hypervisor has block-migrate functionality, and we implement support for +it (this would be theoretically possible, today, with KVM, for example). + + Job priorities -------------- -- GitLab