Commit cdea7fa8 authored by Santi Raffa's avatar Santi Raffa Committed by Thomas Thrainer

Gluster: Update design document

Anticipate and explain the choices made in the Gluster patch series.
Remove parts about a possible way of supporting userspace access as
it has been implemented otherwise.
Signed-off-by: default avatarSanti Raffa <rsanti@google.com>
Reviewed-by: default avatarThomas Thrainer <thomasth@google.com>
parent 543937ce
......@@ -7,90 +7,323 @@ This document describes the plan for adding GlusterFS support inside Ganeti.
.. contents:: :depth: 4
.. highlight:: shell-example
Objective
=========
The aim is to let Ganeti support GlusterFS as one of its backend storage.
This includes three aspects to finish:
- Add Gluster as a storage backend.
- Make sure Ganeti VMs can use GlusterFS backends in userspace mode (for
newer QEMU/KVM which has this support) and otherwise, if possible, through
some kernel exported block device.
- Make sure Ganeti can configure GlusterFS by itself, by just joining
storage space on new nodes to a GlusterFS nodes pool. Note that this
may need another design document that explains how it interacts with
storage pools, and that the node might or might not host VMs as well.
Background
==========
There are two possible ways to implement "GlusterFS Ganeti Support". One is
GlusterFS as one of external backend storage, the other one is realizing
GlusterFS inside Ganeti, that is, as a new disk type for Ganeti. The benefit
of the latter one is that it would not be opaque but fully supported and
integrated in Ganeti, which would not need to add infrastructures for
testing/QAing and such. Having it internal we can also provide a monitoring
agent for it and more visibility into what's going on. For these reasons,
GlusterFS support will be added directly inside Ganeti.
Implementation Plan
===================
Ganeti Side
-----------
To realize an internal storage backend for Ganeti, one should realize
BlockDev class in `ganeti/lib/storage/base.py` that is a specific
class including create, remove and such. These functions should be
realized in `ganeti/lib/storage/bdev.py`. Actually, the differences
between implementing inside and outside (external) Ganeti are how to
finish these functions in BlockDev class and how to combine with Ganeti
itself. The internal implementation is not based on external scripts
and combines with Ganeti in a more compact way. RBD patches may be a
good reference here. Adding a backend storage steps are as follows:
- Implement the BlockDev interface in bdev.py.
- Add the logic in cmdlib (eg, migration, verify).
- Add the new storage type name to constants.
- Modify objects.Disk to support GlusterFS storage type.
- The implementation will be performed similarly to the RBD one (see
commit 7181fba).
GlusterFS side
--------------
GlusterFS is a distributed file system implemented in user space.
The way to access GlusterFS namespace is via FUSE based Gluster native
client except NFS and CIFS. The efficiency of this way is lower because
the data would be pass the kernel space and then come to user space.
Now, there are two specific enhancements:
- A new library called libgfapi is now available as part of GlusterFS
that provides POSIX-like C APIs for accessing Gluster volumes.
libgfapi support will be available from GlusterFS-3.4 release.
- QEMU/KVM (starting from QEMU-1.3) will have GlusterFS block driver that
uses libgfapi and hence there is no FUSE overhead any longer when QEMU/KVM
works with VM images on Gluster volumes.
Proposed implementation
-----------------------
QEMU/KVM includes support for GlusterFS and Ganeti could support GlusterFS
through QEMU/KVM. However, this way could just let VMs of QEMU/KVM use GlusterFS
backend storage but not other VMs like XEN and such. There are two parts that need
to be implemented for supporting GlusterFS inside Ganeti so that it can not only
support QEMU/KVM VMs, but also XEN and other VMs. One part is GlusterFS for XEN VM,
which is similar to sharedfile disk template. The other part is GlusterFS for
QEMU/KVM VM, which is supported by the GlusterFS driver for QEMU/KVM. After
``gnt-instance add -t gluster instance.example.com`` command is executed, the added
instance should be checked. If the instance is a XEN VM, it would run the GlusterFS
sharedfile way. However, if the instance is a QEMU/KVM VM, it would run the
QEMU/KVM + GlusterFS way. For the first part (GlusterFS for XEN VMs), sharedfile
disk template would be a good reference. For the second part (GlusterFS for QEMU/KVM
VMs), RBD disk template would be a good reference. The first part would be finished
at first and then the second part would be completed, which is based on the first
part.
Gluster overview
================
Gluster is a "brick" "translation" service that can turn a number of LVM logical
volume or disks (so-called "bricks") into an unified "volume" that can be
mounted over the network through FUSE or NFS.
This is a simplified view of what components are at play and how they
interconnect as data flows from the actual disks to the instances. The parts in
grey are available for Ganeti to use and included for completeness but not
targeted for implementation at this stage.
.. digraph:: "gluster-ganeti-overview"
graph [ spline=ortho ]
node [ shape=rect ]
{
node [ shape=none ]
_volume [ label=volume ]
bricks -> translators -> _volume
_volume -> network [label=transport]
network -> instances
}
{ rank=same; brick1 [ shape=oval ]
brick2 [ shape=oval ]
brick3 [ shape=oval ]
bricks }
{ rank=same; translators distribute }
{ rank=same; volume [ shape=oval ]
_volume }
{ rank=same; instances instanceA instanceB instanceC instanceD }
{ rank=same; network FUSE NFS QEMUC QEMUD }
{
node [ shape=oval ]
brick1 [ label=brick ]
brick2 [ label=brick ]
brick3 [ label=brick ]
}
{
node [ shape=oval ]
volume
}
brick1 -> distribute
brick2 -> distribute
brick3 -> distribute -> volume
volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>>
color="black:grey" ]
NFS [ color=grey fontcolor=grey ]
volume -> NFS [ label="TCP" color=grey fontcolor=grey ]
NFS -> mountpoint [ color=grey fontcolor=grey ]
mountpoint [ shape=oval ]
FUSE -> mountpoint
instanceA [ label=instances ]
instanceB [ label=instances ]
mountpoint -> instanceA
mountpoint -> instanceB
mountpoint [ shape=oval ]
QEMUC [ label=QEMU ]
QEMUD [ label=QEMU ]
{
instanceC [ label=instances ]
instanceD [ label=instances ]
}
volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>>
color="black:grey" ]
volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>>
color="black:grey" ]
QEMUC -> instanceC
QEMUD -> instanceD
brick:
The unit of storage in gluster. Typically a drive or LVM logical volume
formatted using, for example, XFS.
distribute:
One of the translators in Gluster, it assigns files to bricks based on the
hash of their full path inside the volume.
volume:
A filesystem you can mount on multiple machines; all machines see the same
directory tree and files.
FUSE/NFS:
Gluster offers two ways to mount volumes: through FUSE or a custom NFS server
that is incompatible with other NFS servers. FUSE is more compatible with
other services running on the storage nodes; NFS gives better performance.
For now, FUSE is a priority.
QEMU:
QEMU 1.3 has the ability to use Gluster volumes directly in userspace without
the need for mounting anything. Ganeti still needs kernelspace access at disk
creation and OS install time.
transport:
FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only
supports TCP. Those protocols are called transports in Gluster. For now, TCP
is a priority.
It is the administrator's duty to set up the bricks, the translators and thus
the volume as they see fit. Ganeti will take care of connecting the instances to
a given volume.
.. note::
The gluster mountpoint must be whitelisted by the administrator in
``/etc/ganeti/file-storage-paths`` for security reasons in order to allow
Ganeti to modify the filesystem.
Why not use a ``sharedfile`` disk template?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gluster volumes `can` be used by Ganeti using the generic shared file disk
template. There is a number of reasons why that is probably not a good idea,
however:
* Shared file, being a generic solution, cannot offer userspace access support.
* Even with userspace support, Ganeti still needs kernelspace access in order to
create disks and install OSes on them. Ganeti can manage the mounting for you
so that the Gluster servers only have as many connections as necessary.
* Experiments showed that you can't trust ``mount.glusterfs`` to give useful
return codes or error messages. Ganeti can work around its oddities so
administrators don't have to.
* The shared file folder scheme (``../{instance.name}/disk{disk.id}``) does not
work well with Gluster. The ``distribute`` translator distributes files across
bricks, but directories need to be replicated on `all` bricks. As a result, if
we have a dozen hundred instances, that means a dozen hundred folders being
replicated on all bricks. This does not scale well.
* This frees up the shared file disk template to use a different, unsupported
replication scheme together with Gluster. (Storage pools are the long term
solution for this, however.)
So, while gluster `is` a shared file disk template, essentially, Ganeti can
provide better support for it than that.
Implementation strategy
=======================
Working with GlusterFS in kernel space essentially boils down to:
1. Ask FUSE to mount the Gluster volume.
2. Check that the mount succeeded.
3. Use files stored in the volume as instance disks, just like sharedfile does.
4. When the instances are spun down, attempt unmounting the volume. If the
gluster connection is still required, the mountpoint is allowed to remain.
Since it is not strictly necessary for Gluster to mount the disk if all that's
needed is userspace access, however, it is inappropriate for the Gluster storage
class to inherit from FileStorage. So the implementation should resort to
composition rather than inheritance:
1. Extract the ``FileStorage`` disk-facing logic into a ``FileDeviceHelper``
class.
* In order not to further inflate bdev.py, Filestorage should join its helper
functions in filestorage.py (thus reducing their visibility) and add Gluster
to its own file, gluster.py. Moving the other classes to their own files
like it's been done in ``lib/hypervisor/``) is not addressed as part of this
design.
2. Use the ``FileDeviceHelper`` class to implement a ``GlusterStorage`` class in
much the same way.
3. Add Gluster as a disk template that behaves like SharedFile in every way.
4. Provide Ganeti knowledge about what a ``GlusterVolume`` is and how to mount,
unmount and reference them.
* Before attempting a mount, we should check if the volume is not mounted
already. Linux allows mounting partitions multiple times, but then you also
have to unmount them as many times as you mounted them to actually free the
resources; this also makes the output of commands such as ``mount`` less
useful.
* Every time the device could be released (after instance shutdown, OS
installation scripts or file creation), a single unmount is attempted. If
the device is still busy (e.g. from other instances, jobs or open
administrator shells), the failure is ignored.
5. Modify ``GlusterStorage`` and customize the disk template behavior to fit
Gluster's needs.
Directory structure
~~~~~~~~~~~~~~~~~~~
In order to address the shortcomings of the generic shared file handling of
instance disk directory structure, Gluster uses a different scheme for
determining a disk's logical id and therefore path on the file system.
The naming scheme is::
/ganeti/{instance.uuid}.{disk.id}
...bringing the actual path on a node's file system to::
/var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id}
This means Ganeti only uses one folder on the Gluster volume (allowing other
uses of the Gluster volume in the meantime) and works better with how Gluster
distributes storage over its bricks.
Changes to the storage types system
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ganeti has a number of storage types that abstract over disk templates. This
matters mainly in terms of disk space reporting. Gluster support is improved by
a rethinking of how disk templates are assigned to storage types in Ganeti.
This is the summary of the changes:
+--------------+---------+---------+-------------------------------------------+
| Disk | Current | New | Does it report storage information to... |
| template | storage | storage +-------------+----------------+------------+
| | type | type | ``gnt-node | ``gnt-node | iallocator |
| | | | list`` | list-storage`` | |
+==============+=========+=========+=============+================+============+
| File | File | File | Yes. | Yes. | Yes. |
+--------------+---------+---------+-------------+----------------+------------+
| Shared file | File | Shared | No. | Yes. | No. |
+--------------+---------+ file | | | |
| Gluster (new)| N/A | (new) | | | |
+--------------+---------+---------+-------------+----------------+------------+
| RBD (for | RBD | No. | No. | No. |
| reference) | | | | |
+--------------+-------------------+-------------+----------------+------------+
Gluster or Shared File should not, like RBD, report storage information to
gnt-node list or to IAllocators. Regrettably, the simplest way to do so right
now is by claiming that storage reporting for the relevant storage type is not
implemented. An effort was made to claim that the shared storage type did support
disk reporting while refusing to provide any value, but it was not successful
(``hail`` does not support this combination.)
To do so without breaking the File disk template, a new storage type must be
added. Like RBD, it does not claim to support disk reporting. However, we can
still make an effort of reporting stats to ``gnt-node list-storage``.
The rationale is simple. For shared file and gluster storage, disk space is not
a function of any one node. If storage types with disk space reporting are used,
Hail expects them to give useful numbers for allocation purposes, but a shared
storage system means disk balancing is not affected by node-instance allocation
any longer. Moreover, it would be wasteful to mount a Gluster volume on each
node just for running statvfs() if no machine was actually running gluster VMs.
As a result, Gluster support for gnt-node list-storage is necessarily limited
and nodes on which Gluster is available but not in use will report failures.
Additionally, running ``gnt-node list`` will give an output like this::
Node DTotal DFree MTotal MNode MFree Pinst Sinst
node1.example.com ? ? 744M 273M 477M 0 0
node2.example.com ? ? 744M 273M 477M 0 0
This is expected and consistent with behaviour in RBD.
An alternative would have been to report DTotal and DFree as 0 in order to allow
``hail`` to ignore the disk information, but this incorrectly populates the
``gnt-node list`` DTotal and DFree fields with 0s as well.
New configuration switches
~~~~~~~~~~~~~~~~~~~~~~~~~~
Configurable at the cluster and node group level (``gnt-cluster modify``,
``gnt-group modify`` and other commands that support the `-D` switch to edit
disk parameters):
``gluster:host``
The IP address or hostname of the Gluster server to connect to. In the default
deployment of Gluster, that is any machine that is hosting a brick.
Default: ``"127.0.0.1"``
``gluster:port``
The port where the Gluster server is listening to.
Default: ``24007``
``gluster:volume``
The volume Ganeti should use.
Default: ``"gv0"``
Configurable at the cluster level only (``gnt-cluster init``) and stored in
ssconf for all nodes to read (just like shared file):
``--gluster-dir``
Where the Gluster volume should be mounted.
Default: ``/var/run/ganeti/gluster``
The default values work if all of the Ganeti nodes also host Gluster bricks.
This is possible, but `not` recommended as it can cause the host to hardlock due
to deadlocks in the kernel memory (much in the same way RBD works).
Future work
===========
In no particular order:
* Support the UDP transport.
* Support mounting through NFS.
* Filter ``gnt-node list`` so DTotal and DFree are not shown for RBD and shared
file disk types, or otherwise report the disk storage values as "-" or some
other special value to clearly distinguish it from the result of a
communication failure between nodes.
* Allow configuring the in-volume path Ganeti uses.
.. vim: set textwidth=72 :
.. Local Variables:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment