design-performance-tests.rst 4.96 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
========================
Performance tests for QA
========================

.. contents:: :depth: 4

This design document describes performance tests to be added to QA in
order to measure performance changes over time.

Current state and shortcomings
==============================

Currently, only functional QA tests are performed. Those tests verify
the correct behaviour of Ganeti in various configurations, but are not
designed to continuously monitor the performance of Ganeti.

The current QA tests don't execute multiple tasks/jobs in parallel.
Therefore, the locking part of Ganeti does not really receive any
testing, neither functional nor performance wise.

On the plus side, Ganeti's QA code does already measure the runtime of
individual tests, which is leveraged in this design.

Proposed changes
================

The tests to be added in the context of this design document focus on
two areas:

  * Job queue performance. How does Ganeti handle a lot of submitted
    jobs?
  * Parallel job execution performance. How well does Ganeti
    parallelize jobs?

Jobs are submitted to the job queue in sequential order, but the
execution of the jobs runs in parallel. All job submissions must
complete within a reasonable timeout.

In order to make it easier to recognize performance related tests, all
tests added in the context of this design get a description with a
"PERFORMANCE: " prefix.

Job queue performance
---------------------

Tests targeting the job queue should eliminate external factors (like
network/disk performance or hypervisor delays) as much as possible, so
they are designed to run in a vcluster QA environment.

The following tests are added to the QA:

  * Submit the maximum amount of instance create jobs in parallel. As
    soon as a creation job succeeds, submit a removal job for this
    instance.
  * Submit as many instance create jobs as there are nodes in the
    cluster in parallel (for non-redundant instances). Removal jobs
    as above.
  * For the maximum amount of instances in the cluster, submit modify
    jobs (modify hypervisor and backend parameters) in parallel.
  * For the maximum amount of instances in the cluster, submit stop,
    start, reboot and reinstall jobs in parallel.
  * For the maximum amount of instances in the cluster, submit multiple
    list and info jobs in parallel.
  * For the maximum amount of instances in the cluster, submit move
    jobs in parallel. While the move operations are running, get
    instance information using info jobs. Those jobs are required to
    return within a reasonable low timeout.
  * For the maximum amount of instances in the cluster, submit add-,
    remove- and list-tags jobs.
70
  * Submit 200 `gnt-debug delay` jobs with a delay of 0.1 seconds. To
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
    speed up submission, perform multiple job submissions in parallel.
    Verify that submitting jobs doesn't significantly slow down during
    the process. Verify that querying cluster information over CLI and
    RAPI succeeds in a timely fashion with the delay jobs
    running/queued.

Parallel job execution performance
----------------------------------

Tests targeting the performance of parallel execution of "real" jobs
in close-to-production clusters should actually perform all operations,
such as creating disks and starting instances. This way, real world
locking or waiting issues can be reproduced. Performing all those
operations does requires quite some time though, so only a smaller
number of instances and parallel jobs can be tested realistically.

The following tests are added to the QA:

  * Submitting twice as many instance creation request as there are
90 91 92 93 94 95
    nodes in the cluster, using DRBD as disk template.
    The job parameters are chosen according to best practice for
    parallel instance creation without running the risk of instance
    creation failing for too many parallel creation attempts.
    As soon as a creation job succeeds, submit a removal job for
    this instance.
96 97 98 99 100 101 102 103
  * Submitting twice as many instance creation request as there are
    nodes in the cluster, using Plain as disk template. As soon as a
    creation job succeeds, submit a removal job for this instance.
    This test can make better use of parallelism because only one
    node must be locked for an instance creation.
  * Create an instance using DRBD. Fail it over, migrate it, change
    its secondary node, reboot it and reinstall it while creating an
    additional instance in parallel to each of those operations.
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122

Future work
===========

Based on test results of the tests listed above, additional tests can
be added to cover more real-world use-cases. Also, based on user
requests, specially crafted performance tests modeling those workloads
can be added too.

Additionally, the correlations between job submission time and job
queue size could be detected. Therefore, a snapshot of the job queue
before job submission could be taken to measure job submission time
based on the jobs in the queue.

.. vim: set textwidth=72 :
.. Local Variables:
.. mode: rst
.. fill-column: 72
.. End: