diff --git a/Makefile.am b/Makefile.am
index cc4c523f82cdb14031839de94b01b9fe29be708d..b4cff3158de54025e759c78a51d8edb91e1839e7 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -275,6 +275,7 @@ docrst = \
 	doc/design-query2.rst \
 	doc/design-x509-ca.rst \
 	doc/design-http-server.rst \
+	doc/design-impexp2.rst \
 	doc/cluster-merge.rst \
 	doc/design-shared-storage.rst \
 	doc/devnotes.rst \
diff --git a/doc/design-draft.rst b/doc/design-draft.rst
index 40dc2db9dd5e5b25ec69d116d0a28983a15e703d..07148383a9c49fb3e4258d499c4d5d9383d6ade9 100644
--- a/doc/design-draft.rst
+++ b/doc/design-draft.rst
@@ -7,6 +7,7 @@ Design document drafts
+   design-impexp2.rst
 .. vim: set textwidth=72 :
 .. Local Variables:
diff --git a/doc/design-impexp2.rst b/doc/design-impexp2.rst
new file mode 100644
index 0000000000000000000000000000000000000000..5b996fe1161732695892c264e450598f7e4e8738
--- /dev/null
+++ b/doc/design-impexp2.rst
@@ -0,0 +1,559 @@
+Design for import/export version 2
+.. contents:: :depth: 4
+Current state and shortcomings
+Ganeti 2.2 introduced :doc:`inter-cluster instance moves <design-2.2>`
+and replaced the import/export mechanism with the same technology. It's
+since shown that the chosen implementation was too complicated and and
+can be difficult to debug.
+The old implementation is henceforth called "version 1". It used
+``socat`` in combination with a rather complex tree of ``bash`` and
+Python utilities to move instances between clusters and import/export
+them inside the cluster. Due to protocol limitations, the master daemon
+starts a daemon on the involved nodes and then keeps polling a status
+file for updates. A non-trivial number of timeouts ensures that jobs
+don't freeze.
+In version 1, the destination node would start a daemon listening on a
+random TCP port. Upon receiving the destination information, the source
+node would temporarily stop the instance, create snapshots, and start
+exporting the data by connecting to the destination. The random TCP port
+is chosen by the operating system by binding the socket to port 0.
+While this is a somewhat elegant solution, it causes problems in setups
+with restricted connectivity (e.g. iptables).
+Another issue encountered was with dual-stack IPv6 setups. ``socat`` can
+only listen on one protocol, IPv4 or IPv6, at a time. The connecting
+node can not simply resolve the DNS name, but it must be told the exact
+IP address.
+Instance OS definitions can provide custom import/export scripts. They
+were working well in the early days when a filesystem was usually
+created directly on the block device. Around Ganeti 2.0 there was a
+transition to using partitions on the block devices. Import/export
+scripts could no longer use simple ``dump`` and ``restore`` commands,
+but usually ended up doing raw data dumps.
+Proposed changes
+Unlike in version 1, in version 2 the destination node will connect to
+the source. The active side is swapped. This design assumes the
+following design documents have been implemented:
+- :doc:`design-x509-ca`
+- :doc:`design-http-server`
+The following design is mostly targetted at inter-cluster instance
+moves. Intra-cluster import and export use the same technology, but do
+so in a less complicated way (e.g. reusing the node daemon certificate
+in version 1).
+Support for instance OS import/export scripts, which have been in Ganeti
+since the beginning, will be dropped with this design. Should the need
+arise, they can be re-added later.
+Software requirements
+- HTTP client: cURL/pycURL (already used for inter-node RPC and RAPI
+  client)
+- Authentication: X509 certificates (server and client)
+Instead of a home-grown, mostly raw protocol the widely used HTTP
+protocol will be used. Ganeti already uses HTTP for its :doc:`Remote API
+<rapi>` and inter-node communication. Encryption and authentication will
+be implemented using SSL and X509 certificates.
+SSL certificates
+The source machine will identify connecting clients by their SSL
+certificate. Unknown certificates will be refused.
+Version 1 created a new self-signed certificate per instance
+import/export, allowing the certificate to be used as a Certificate
+Authority (CA). This worked by means of starting a new ``socat``
+instance per instance import/export.
+Under the version 2 model, a continously running HTTP server will be
+used. This disallows the use of self-signed certificates for
+authentication as the CA needs to be the same for all issued
+See the :doc:`separate design document for more details on how the
+certificate authority will be implemented <design-x509-ca>`.
+Local imports/exports will, like version 1, use the node daemon's
+certificate/key. Doing so allows the verification of local connections.
+The client's certificate can be exported to the CGI/FastCGI handler
+using lighttpd's ``ssl.verifyclient.exportcert`` setting. If a
+cluster-local import/export is being done, the handler verifies if the
+used certificate matches with the local node daemon key.
+The source can be the same physical machine as the destination, another
+node in the same cluster, or a node in another cluster. A
+physical-to-virtual migration mechanism could be implemented as an
+alternative source.
+In the case of a traditional import, the source is usually a file on the
+source machine. For exports and remote imports, the source is an
+instance's raw disk data. In all cases the transported data is opaque to
+All nodes of a cluster will run an instance of Lighttpd. The
+configuration is automatically generated when starting Ganeti. The HTTP
+server is configured to listen on IPv4 and IPv6 simultaneously.
+Imports/exports will use a dedicated TCP port, similar to the Remote
+See the separate :ref:`HTTP server design document
+<http-srv-shortcomings>` for why Ganeti's existing, built-in HTTP server
+is not a good choice.
+The source cluster is provided with a X509 Certificate Signing Request
+(CSR) for a key private to the destination cluster.
+After shutting down the instance, creating snapshots and restarting the
+instance the master will sign the destination's X509 certificate using
+the :doc:`X509 CA <design-x509-ca>` once per instance disk. Instead of
+using another identifier, the certificate's serial number (:ref:`never
+reused <x509-ca-serial>`) and fingerprint are used to identify incoming
+requests. Once ready, the master will call an RPC method on the source
+node and provide it with the input information (e.g. file paths or block
+devices) and the certificate identities.
+The RPC method will write the identities to a place accessible by the
+HTTP request handler, generate unique transfer IDs and return them to
+the master. The transfer ID could be a filename containing the
+certificate's serial number, fingerprint and some disk information. The
+file containing the per-transfer information is signed using the node
+daemon key and the signature written to a separate file.
+Once everything is in place, the master sends the certificates, the data
+and notification URLs (which include the transfer IDs) and the public
+part of the source's CA to the job submitter. Like in version 1,
+everything will be signed using the cluster domain secret.
+Upon receiving a request, the handler verifies the identity and
+continues to stream the instance data. The serial number and fingerprint
+contained in the transfer ID should be matched with the certificate
+used. If a cluster-local import/export was requested, the remote's
+certificate is verified with the local node daemon key. The signature of
+the information file from which the handler takes the path of the block
+device (and more) is verified using the local node daemon certificate.
+There are two options for handling requests, :ref:`CGI
+<lighttpd-cgi-opt>` and :ref:`FastCGI <lighttpd-fastcgi-opt>`.
+To wait for all requests to finish, the master calls another RPC method.
+The destination should notify the source once it's done with downloading
+the data. Since this notification may never arrive (e.g. network
+issues), an additional timeout needs to be used.
+There is no good way to avoid polling as the HTTP requests will be
+handled asynchronously in another process. Once, and if, implemented
+:ref:`RPC feedback <rpc-feedback>` could be used to combine the two RPC
+Upon completion of the transfer requests, the instance is removed if
+.. _lighttpd-cgi-opt:
+Option 1: CGI
+While easier to implement, this option requires the HTTP server to
+either run as "root" or a so-called SUID binary to elevate the started
+process to run as "root".
+The export data can be sent directly to the HTTP server without any
+further processing.
+.. _lighttpd-fastcgi-opt:
+Option 2: FastCGI
+Unlike plain CGI, FastCGI scripts are run separately from the webserver.
+The webserver talks to them via a Unix socket. Webserver and scripts can
+run as separate users. Unlike for CGI, there are almost no bootstrap
+costs attached to each request.
+The FastCGI protocol requires data to be sent in length-prefixed
+packets, something which wouldn't be very efficient to do in Python for
+large amounts of data (instance imports/exports can be hundreds of
+gigabytes). For this reason the proposal is to use a wrapper program
+written in C (e.g. `fcgiwrap
+<http://nginx.localdomain.pl/wiki/FcgiWrap>`_) and to write the handler
+like an old-style CGI program with standard input/output. If data should
+be copied from a file, ``cat``, ``dd`` or ``socat`` can be used (see
+note about :ref:`sendfile(2)/splice(2) with Python <python-sendfile>`).
+The bootstrap cost associated with starting a Python interpreter for
+a disk export is expected to be negligible.
+The `spawn-fcgi <http://cgit.stbuehler.de/gitosis/spawn-fcgi/about/>`_
+program will be used to start the CGI wrapper as "root".
+FastCGI is, in the author's opinion, the better choice as it allows user
+separation. As a first implementation step the export handler can be run
+as a standard CGI program. User separation can be implemented as a
+second step.
+The destination can be the same physical machine as the source, another
+node in the same cluster, or a node in another cluster. While not
+considered in this design document, instances could be exported from the
+cluster by implementing an external client for exports.
+For traditional exports the destination is usually a file on the
+destination machine. For imports and remote exports, the destination is
+an instance's disks. All transported data is opaque to Ganeti.
+Before an import can be started, an RSA key and corresponding
+Certificate Signing Request (CSR) must be generated using the new opcode
+``OpInstanceImportPrepare``. The returned information is signed using
+the cluster domain secret. The RSA key backing the CSR must not leave
+the destination cluster. After being passed through a third party, the
+source cluster will generate signed certificates from the CSR.
+Once the request for creating the instance arrives at the master daemon,
+it'll create the instance and call an RPC method on the instance's
+primary node to download all data. The RPC method does not return until
+the transfer is complete or failed (see :ref:`EXP_SIZE_FD <exp-size-fd>`
+and :ref:`RPC feedback <rpc-feedback>`).
+The node will use pycURL to connect to the source machine and identify
+itself with the signed certificate received. pycURL will be configured
+to write directly to a file descriptor pointing to either a regular file
+or block device. The file descriptor needs to point to the correct
+offset for resuming downloads.
+Using cURL's multi interface, more than one transfer can be made at the
+same time. While parallel transfers are used by the version 1
+import/export, it can be decided at a later time whether to use them in
+version 2 too. More investigation is necessary to determine whether
+``CURLOPT_MAXCONNECTS`` is enough to limit the number of connections or
+whether more logic is necessary.
+If a transfer fails before it's finished (e.g. timeout or network
+issues) it should be retried using an exponential backoff delay. The
+opcode submitter can specify for how long the transfer should be
+At the end of a transfer, succssful or not, the source cluster must be
+notified. A the same time the RSA key needs to be destroyed.
+Support for HTTP proxies can be implemented by setting
+``CURLOPT_PROXY``. Proxies could be used for moving instances in/out of
+restricted network environments or across protocol borders (e.g. IPv4
+networks unable to talk to IPv6 networks).
+The big picture for instance moves
+#. ``OpInstanceImportPrepare`` (destination cluster)
+  Create RSA key and CSR (certificate signing request), return signed
+  with cluster domain secret.
+#. ``OpBackupPrepare`` (source cluster)
+  Becomes a no-op in version 2, but see :ref:`backwards-compat`.
+#. ``OpBackupExport`` (source cluster)
+  - Receives destination cluster's CSR, verifies signature using
+    cluster domain secret.
+  - Creates certificates using CSR and :doc:`cluster CA
+    <design-x509-ca>`, one for each disk
+  - Stop instance, create snapshots, start instance
+  - Prepare HTTP resources on node
+  - Send certificates, URLs and CA certificate to job submitter using
+    feedback mechanism
+  - Wait for all transfers to finish or fail (with timeout)
+  - Remove snapshots
+#. ``OpInstanceCreate`` (destination cluster)
+  - Receives certificates signed by destination cluster, verifies
+    certificates and URLs using cluster domain secret
+    Note that the parameters should be implemented in a generic way
+    allowing future extensions, e.g. to download disk images from a
+    public, remote server. The cluster domain secret allows Ganeti to
+    check data received from a third party, but since this won't work
+    with such extensions, other checks will have to be designed.
+  - Create block devices
+  - Download every disk from source, verified using remote's CA and
+    authenticated using signed certificates
+  - Destroy RSA key and certificates
+  - Start instance
+.. TODO: separate create from import?
+.. _impexp2-http-resources:
+HTTP resources on source
+The HTTP resources listed below will be made available by the source
+machine. The transfer ID is generated while preparing the export and is
+unique per disk and instance. No caching should be used and the
+``Pragma`` (HTTP/1.0) and ``Cache-Control`` (HTTP/1.1) headers set
+accordingly by the server.
+``GET /transfers/[transfer_id]/contents``
+  Dump disk contents. Important request headers:
+  ``Accept`` (:rfc:`2616`, section 14.1)
+    Specify preferred media types. Only one type is supported in the
+    initial implementation:
+    ``application/octet-stream``
+      Request raw disk content.
+    If support for more media types were to be implemented in the
+    future, the "q" parameter used for "indicating a relative quality
+    factor" needs to be used. In the meantime parameters need to be
+    expected, but can be ignored.
+    If support for OS scripts were to be re-added in the future, the
+    MIME type ``application/x-ganeti-instance-export`` is hereby
+    reserved for disk dumps using an export script.
+    If the source can not satisfy the request the response status code
+    will be 406 (Not Acceptable). Successful requests will specify the
+    used media type using the ``Content-Type`` header. Unless only
+    exactly one media type is requested, the client must handle the
+    different response types.
+  ``Accept-Encoding`` (:rfc:`2616`, section 14.3)
+    Specify desired content coding. Supported are ``identity`` for
+    uncompressed data, ``gzip`` for compressed data and ``*`` for any.
+    The response will include a ``Content-Encoding`` header with the
+    actual coding used. If the client specifies an unknown coding, the
+    response status code will be 406 (Not Acceptable).
+    If the client specifically needs compressed data (see
+    :ref:`impexp2-compression`) but only gets ``identity``, it can
+    either compress locally or abort the request.
+  ``Range`` (:rfc:`2616`, section 14.35)
+    Raw disk dumps can be resumed using this header (e.g. after a
+    network issue).
+    If this header was given in the request and the source supports
+    resuming, the status code of the response will be 206 (Partial
+    Content) and it'll include the ``Content-Range`` header as per
+    :rfc:`2616`. If it does not support resuming or the request was not
+    specifying a range, the status code will be 200 (OK).
+    Only a single byte range is supported. cURL does not support
+    ``multipart/byteranges`` responses by itself. Even if they could be
+    somehow implemented, doing so would be of doubtful benefit for
+    import/export.
+    For raw data dumps handling ranges is pretty straightforward by just
+    dumping the requested range.
+    cURL will fail with the error code ``CURLE_RANGE_ERROR`` if a
+    request included a range but the server can't handle it. The request
+    must be retried without a range.
+``POST /transfers/[transfer_id]/done``
+  Use this resource to notify the source when transfer is finished (even
+  if not successful). The status code will be 204 (No Content).
+Code samples
+pycURL to file
+.. highlight:: python
+The following code sample shows how to write downloaded data directly to
+a file without pumping it through Python::
+  curl = pycurl.Curl()
+  curl.setopt(pycurl.URL, "http://www.google.com/")
+  curl.setopt(pycurl.WRITEDATA, open("googlecom.html", "w"))
+  curl.perform()
+This works equally well if the file descriptor is a pipe to another
+.. _backwards-compat:
+Backwards compatibility
+.. _backwards-compat-v1:
+Version 1
+The old inter-cluster import/export implementation described in the
+:doc:`Ganeti 2.2 design document <design-2.2>` will be supported for at
+least one minor (2.x) release. Intra-cluster imports/exports will use
+the new version right away.
+.. _exp-size-fd:
+Together with the improved import/export infrastructure Ganeti 2.2
+allowed instance export scripts to report the expected data size. This
+was then used to provide the user with an estimated remaining time.
+Version 2 no longer supports OS import/export scripts and therefore
+``EXP_SIZE_FD`` is no longer needed.
+.. _impexp2-compression:
+Version 1 used explicit compression using ``gzip`` for transporting
+data, but the dumped files didn't use any compression. Version 2 will
+allow the destination to specify which encoding should be used. This way
+the transported data is already compressed and can be directly used by
+the client (see :ref:`impexp2-http-resources`). The cURL option
+``CURLOPT_ENCODING`` can be used to set the ``Accept-Encoding`` header.
+cURL will not decompress received data when
+``CURLOPT_HTTP_CONTENT_DECODING`` is set to zero (if another HTTP client
+library were used which doesn't support disabling transparent
+compression, a custom content-coding type could be defined, e.g.
+The HTTP/1.1 protocol (:rfc:`2616`) defines trailing headers for chunked
+transfers in section 3.6.1. This could be used to transfer a checksum at
+the end of an import/export. cURL supports trailing headers since
+version 7.14.1. Lighttpd doesn't seem to support them for FastCGI, but
+they appear to be usable in combination with an NPH CGI (No Parsed
+.. _lighttp-sendfile:
+Lighttpd allows FastCGI applications to send the special headers
+``X-Sendfile`` and ``X-Sendfile2`` (the latter with a range). Using
+these headers applications can send response headers and tell the
+webserver to serve regular file stored on the file system as a response
+body. The webserver will then take care of sending that file.
+Unfortunately this mechanism is restricted to regular files and can not
+be used for data from programs, neither direct nor via named pipes,
+without writing to a file first. The latter is not an option as instance
+data can be very large. Theoretically ``X-Sendfile`` could be used for
+sending the input for a file-based instance import, but that'd require
+the webserver to run as "root".
+.. _python-sendfile:
+Python does not include interfaces for the ``sendfile(2)`` or
+``splice(2)`` system calls. The latter can be useful for faster copying
+of data between file descriptors. There are some 3rd-party modules (e.g.
+http://pypi.python.org/pypi/py-sendfile/) and discussions
+(http://bugs.python.org/issue10882) for including support for
+``sendfile(2)``, but the later is certainly not going to happen for the
+Python versions supported by Ganeti. Calling the function using the
+``ctypes`` module might be possible.
+Performance considerations
+The design described above was confirmed to be one of the better choices
+in terms of download performance with bigger block sizes. All numbers
+were gathered on the same physical machine with a single CPU and 1 GB of
+RAM while downloading 2 GB of zeros read from ``/dev/zero``. ``wget``
+(version 1.10.2) was used as the client, ``lighttpd`` (version 1.4.28)
+as the server. The numbers in the first line are in megabytes per
+second. The second line in each row is the CPU time spent in userland
+respective system (measured for the CGI/FastCGI program using ``time
+  ----------------------------------------------------------------------
+  Block size                      4 KB    64 KB   128 KB    1 MB    4 MB
+  ======================================================================
+  Plain CGI script reading          83      174      180     122     120
+  from ``/dev/zero``
+                               0.6/3.9  0.1/2.4  0.1/2.2 0.0/1.9 0.0/2.1
+  ----------------------------------------------------------------------
+  FastCGI with ``fcgiwrap``,        86      167      170     177     174
+  ``dd`` reading from
+  ``/dev/zero``                  1.1/5  0.5/2.9  0.5/2.7 0.7/3.1 0.7/2.8
+  ----------------------------------------------------------------------
+  FastCGI with ``fcgiwrap``,        68      146      150     170     170
+  Python script copying from
+  ``/dev/zero`` to stdout
+                               1.3/5.1  0.8/3.7  0.7/3.3  0.9/2.9  0.8/3
+  ----------------------------------------------------------------------
+  FastCGI, Python script using      31       48       47       5       1
+  ``flup`` library (version
+  1.0.2) reading from
+  ``/dev/zero``
+                              23.5/9.8 14.3/8.5   16.1/8       -       -
+  ----------------------------------------------------------------------
+It should be mentioned that the ``flup`` library is not implemented in
+the most efficient way, but even with some changes it doesn't get much
+faster. It is fine for small amounts of data, but not for huge
+Other considered solutions
+Another possible solution considered was to use ``socat`` like version 1
+did. Due to the changing model, a large part of the code would've
+required a rewrite anyway, while still not fixing all shortcomings. For
+example, ``socat`` could still listen on only one protocol, IPv4 or
+IPv6. Running two separate instances might have fixed that, but it'd get
+more complicated. Using an existing HTTP server will provide us with a
+number of other benefits as well, such as easier user separation between
+server and backend.
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End: