design-shared-storage.rst 15.1 KB
Newer Older
1 2 3
Ganeti shared storage support
4 5

This document describes the changes in Ganeti 2.3+ compared to Ganeti
2.3 storage model. It also documents the ExtStorage Interface.
7 8

.. contents:: :depth: 4
.. highlight:: shell-example
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


The aim is to introduce support for externally mirrored, shared storage.
This includes two distinct disk templates:

- A shared filesystem containing instance disks as regular files
  typically residing on a networked or cluster filesystem (e.g. NFS,
  AFS, Ceph, OCFS2, etc.).
- Instance images being shared block devices, typically LUNs residing on
  a SAN appliance.


26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
DRBD is currently the only shared storage backend supported by Ganeti.
DRBD offers the advantages of high availability while running on
commodity hardware at the cost of high network I/O for block-level
synchronization between hosts. DRBD's master-slave model has greatly
influenced Ganeti's design, primarily by introducing the concept of
primary and secondary nodes and thus defining an instance's “mobility

Although DRBD has many advantages, many sites choose to use networked
storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
which provide shared storage without the administrative overhead of DRBD
nor the limitation of a 1:1 master-slave setup. Furthermore, new
distributed filesystems such as Ceph are becoming viable alternatives to
expensive storage appliances. Support for both modes of operation, i.e.
shared block storage and shared file storage backend would make Ganeti a
robust choice for high-availability virtualization clusters.

Throughout this document, the term “externally mirrored storage” will
refer to both modes of shared storage, suggesting that Ganeti does not
need to take care about the mirroring process from one host to another.

Use cases

50 51 52
We consider the following use cases:

- A virtualization cluster with FibreChannel shared storage, mapping at
Iustin Pop's avatar
Iustin Pop committed
  least one LUN per instance, accessible by the whole cluster.
54 55 56 57 58 59 60 61 62 63 64 65 66 67
- A virtualization cluster with instance images stored as files on an
  NFS server.
- A virtualization cluster storing instance images on a Ceph volume.

Design Overview

The design addresses the following procedures:

- Refactoring of all code referring to constants.DTS_NET_MIRROR.
- Obsolescence of the primary-secondary concept for externally mirrored
- Introduction of a shared file storage disk template for use with networked
- Introduction of a shared block device disk template with device
- Introduction of the External Storage Interface.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

Additionally, mid- to long-term goals include:

- Support for external “storage pools”.

Refactoring of all code referring to constants.DTS_NET_MIRROR

Currently, all storage-related decision-making depends on a number of
frozensets in lib/, typically constants.DTS_NET_MIRROR.
However, constants.DTS_NET_MIRROR is used to signify two different

- A storage device that is shared
- A storage device whose mirroring is supervised by Ganeti

We propose the introduction of two new frozensets to ease

- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and

Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect
the status of the storage as internally mirrored by Ganeti.

Thus, checks could be grouped into the following categories:

- Mobility checks, like whether an instance failover or migration is
  possible should check against constants.DTS_MIRRORED
- Syncing actions should be performed only for templates in

Obsolescence of the primary-secondary node model

The primary-secondary node concept has primarily evolved through the use
of DRBD. In a globally shared storage framework without need for
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
following reasons:

1. Access to the storage does not necessarily imply different roles for
   the nodes (e.g. primary vs secondary).
2. The same storage is available to potentially more than 2 nodes. Thus,
   an instance backed by a SAN LUN for example may actually migrate to
   any of the other nodes and not just a pre-designated failover node.

The proposed solution is using the iallocator framework for run-time
decision making during migration and failover, for nodes with disk
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
gnt-node will be required to accept target node and/or iallocator
specification for these operations. Modifications of the iallocator
protocol will be required to address at least the following needs:

- Allocation tools must be able to distinguish between internal and
  external storage
- Migration/failover decisions must take into account shared storage

Introduction of a shared file disk template

Basic shared file storage support can be implemented by creating a new
disk template based on the existing FileStorage class, with only minor
modifications in lib/ The shared file disk template relies on a
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being
mounted on all nodes under the same path, where instance images will be

A new cluster initialization option is added to specify the mountpoint
of the shared filesystem.

The remainder of this document deals with shared block storage.

Introduction of a shared block device template

Basic shared block device support will be implemented with an additional
disk template. This disk template will not feature any kind of storage
control (provisioning, removal, resizing, etc.), but will instead rely
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD
devices, remote iSCSI targets, etc.).

The shared block device template will make the following assumptions:

- The adopted block device has a consistent name across all nodes,
  enforced e.g. via udev rules.
- The device will be available with the same path under all nodes in the
  node group.

Introduction of the External Storage Interface

164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185

To extend the shared block storage template and give Ganeti the ability
to control and manipulate external storage (provisioning, removal,
growing, etc.) we need a more generic approach. The generic method for
supporting external shared storage in Ganeti will be to have an
ExtStorage provider for each external shared storage hardware type. The
ExtStorage provider will be a set of files (executable scripts and text
files), contained inside a directory which will be named after the
provider. This directory must be present across all nodes of a nodegroup
(Ganeti doesn't replicate it), in order for the provider to be usable by
Ganeti for this nodegroup (valid). The external shared storage hardware
should also be accessible by all nodes of this nodegroup too.

An “ExtStorage provider” will have to provide the following methods:

- Create a disk
- Remove a disk
- Grow a disk
- Attach a disk to a given node
- Detach a disk from a given node
- SetInfo to a disk (add metadata)
- Verify its supported parameters
188 189 190
- Snapshot a disk (optional)
- Open a disk (optional)
- Close a disk (optional)
191 192 193 194 195 196 197 198 199 200

The proposed ExtStorage interface borrows heavily from the OS
interface and follows a one-script-per-function approach. An ExtStorage
provider is expected to provide the following scripts:

- ``create``
- ``remove``
- ``grow``
- ``attach``
- ``detach``
- ``setinfo``
- ``verify``
- ``snapshot`` (optional)
204 205
- ``open`` (optional)
- ``close`` (optional)
206 207 208

All scripts will be called with no arguments and get their input via
environment variables. A common set of variables will be exported for
all commands, and some commands might have extra variables.
210 211 212 213 214 215

  The name of the volume. This is unique for Ganeti and it
  uses it to refer to a specific volume inside the external storage.
  The volume's size in mebibytes.
  Available only to the `create` and `grow` scripts.
217 218 219 220 221 222
  Available only to the `grow` script. It declares the
  new size of the volume after grow (in mebibytes).
  ExtStorage parameter, where `name` is the parameter in
  upper-case (same as OS interface's ``OSP_*`` parameters).
223 224 225
  A string containing metadata to be set for the volume.
  This is exported only to the ``setinfo`` script.
226 227 228
  The human readable name of the disk (if any).
  The name of the volume's snapshot.
230 231
  Available only to the `snapshot` script.
  The size of the volume's snapshot.
  Available only to the `snapshot` script.
234 235 236
  Whether the volume will be accessed exclusively or not.
  Available only to the `open` script.
237 238 239 240 241 242 243

All scripts except `attach` should return 0 on success and non-zero on
error, accompanied by an appropriate error message on stderr. The
`attach` script should return a string on stdout on success, which is
the block device's full path, after it has been successfully attached to
the host node. On error it should return non-zero.

244 245 246 247 248 249 250 251
The ``snapshot``, ``open`` and ``close`` scripts are introduced after
the first implementation of the ExtStorage Interface. To keep backwards
compatibility with the first implementation, we make these scripts

The ``snapshot`` script, if present, will be used for instance backup
export. The ``open`` script makes the device ready for I/O. The ``close``
script disables the I/O on the device.

253 254 255 256 257 258

To support the ExtStorage interface, we will introduce a new disk
template called `ext`. This template will implement the existing Ganeti
disk interface in `lib/` (create, remove, attach, assemble,
259 260
shutdown, grow, setinfo, open, close),
and will simultaneously pass control to the
261 262
external scripts to actually handle the above actions. The `ext` disk
template will act as a translation layer between the current Ganeti disk
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296
interface and the ExtStorage providers.

We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER =
provider`, which will be used at the command line to select the desired
ExtStorage provider. This parameter will be valid only for template
`ext` e.g.::

  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1

The Extstorage interface will support different disks to be created by
different providers. e.g.::

  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \
                            --disk=1:size=1G,provider=sample_provider2 \

Finally, the ExtStorage interface will support passing of parameters to
the ExtStorage provider. This will also be done per disk, from the
command line::

 $ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\

The above parameters will be exported to the ExtStorage provider's
scripts as the enviromental variables:

- `EXTP_PARAM1 = str(value1)`
- `EXTP_PARAM2 = str(value2)`

We will also introduce a new Ganeti client called `gnt-storage` which
will be used to diagnose ExtStorage providers and show information about
them, similarly to the way  `gnt-os diagose` and `gnt-os info` handle OS

297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
ExtStorage Interface support for userspace access


The ExtStorage Interface gets extended to cater for ExtStorage providers
that support userspace access. This will allow the instances to access
their external storage devices directly without going through a block
device, avoiding expensive context switches with kernel space and the
potential for deadlocks in low memory scenarios. The implementation
should be backwards compatible and allow existing ExtStorage
providers to work as is.


Since the implementation should be backwards compatible we are not going
to add a new script in the set of scripts an ExtStorage provider should
ship with. Instead, the 'attach' script, which is currently responsible
to map the block device and return a valid device path, should also be
responsible for providing the URIs that will be used by each
hypervisor. Even though Ganeti currently allows userspace access only
for the KVM hypervisor, we want the implementation to enable the
extstorage providers to support more than one hypervisors for future

More specifically, the 'attach' script will be allowed to return more
than one line. The first line will contain as always the block device
path. Each one of the extra lines will contain a URI to be used for the
userspace access by a specific hypervisor. Each URI should be prefixed
with the hypervisor it corresponds to (e.g. kvm:<uri>). The prefix will
be case insensitive. If the 'attach' script doesn't return any extra
lines, we assume that the ExtStorage provider doesn't support userspace
access (this way we maintain backward compatibility with the existing
'attach' scripts).

The 'GetUserspaceAccessUri' method of the 'ExtStorageDevice' class will
parse the output of the 'attach' script and if the provider supports
userspace access for the requested hypervisor, it will use the
corresponding URI instead of the block device itself.

339 340
Long-term shared storage goals

342 343 344 345 346
Storage pool handling

A new cluster configuration attribute will be introduced, named
“storage_pools”, modeled as a dictionary mapping storage pools to
external storage providers (see below), e.g.::
348 349 350 351 352 353 354 355 356 357 358 359 360

  "nas1": "foostore",
  "nas2": "foostore",
  "cloud1": "barcloud",

Ganeti will not interpret the contents of this dictionary, although it
will provide methods for manipulating them under some basic constraints
(pool identifier uniqueness, driver existence). The manipulation of
storage pools will be performed by implementing new options to the
`gnt-cluster` command::

361 362 363
 $ gnt-cluster modify --add-pool nas1 foostore
 $ gnt-cluster modify --remove-pool nas1 # There must be no instances using
                                         # the pool to remove it
364 365 366 367 368

Furthermore, the storage pools will be used to indicate the availability
of storage pools to different node groups, thus specifying the
instances' “mobility domain”.

369 370 371 372 373
The pool, in which to put the new instance's disk, will be defined at
the command line during `instance add`. This will become possible by
replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL
= pool`. The cmdlib logic will then look at the cluster-level mapping
dictionary to determine the ExtStorage provider for the given pool.

375 376

378 379 380 381 382
The ``gnt-storage`` client can be extended to support pool management
(creation/modification/deletion of pools, connection/disconnection of
pools to nodegroups, etc.). It can also be extended to diagnose and
provide information for internal disk templates too, such as lvm and
383 384

.. vim: set textwidth=72 :