Commit 692ee302 authored by Thomas Thrainer's avatar Thomas Thrainer

Add design doc for performance tests

This design doc describes which tests are added in order to test the
performance of Ganeti, specifically when handling multiple jobs in
parallel.

Note that this design doc is submitted to stable-2.10 so performance
changes over different Ganeti versions can be captured.
Signed-off-by: default avatarThomas Thrainer <thomasth@google.com>
Reviewed-by: default avatarPetr Pudlak <pudlak@google.com>
parent 98370c75
......@@ -534,6 +534,7 @@ docinput = \
doc/design-optables.rst \
doc/design-ovf-support.rst \
doc/design-partitioned.rst \
doc/design-performance-tests.rst \
doc/design-query-splitting.rst \
doc/design-query2.rst \
doc/design-reason-trail.rst \
......
......@@ -19,6 +19,7 @@ Design document drafts
design-ceph-ganeti-support.rst
design-daemons.rst
design-hsqueeze.rst
design-performance-tests.rst
.. vim: set textwidth=72 :
.. Local Variables:
......
========================
Performance tests for QA
========================
.. contents:: :depth: 4
This design document describes performance tests to be added to QA in
order to measure performance changes over time.
Current state and shortcomings
==============================
Currently, only functional QA tests are performed. Those tests verify
the correct behaviour of Ganeti in various configurations, but are not
designed to continuously monitor the performance of Ganeti.
The current QA tests don't execute multiple tasks/jobs in parallel.
Therefore, the locking part of Ganeti does not really receive any
testing, neither functional nor performance wise.
On the plus side, Ganeti's QA code does already measure the runtime of
individual tests, which is leveraged in this design.
Proposed changes
================
The tests to be added in the context of this design document focus on
two areas:
* Job queue performance. How does Ganeti handle a lot of submitted
jobs?
* Parallel job execution performance. How well does Ganeti
parallelize jobs?
Jobs are submitted to the job queue in sequential order, but the
execution of the jobs runs in parallel. All job submissions must
complete within a reasonable timeout.
In order to make it easier to recognize performance related tests, all
tests added in the context of this design get a description with a
"PERFORMANCE: " prefix.
Job queue performance
---------------------
Tests targeting the job queue should eliminate external factors (like
network/disk performance or hypervisor delays) as much as possible, so
they are designed to run in a vcluster QA environment.
The following tests are added to the QA:
* Submit the maximum amount of instance create jobs in parallel. As
soon as a creation job succeeds, submit a removal job for this
instance.
* Submit as many instance create jobs as there are nodes in the
cluster in parallel (for non-redundant instances). Removal jobs
as above.
* For the maximum amount of instances in the cluster, submit modify
jobs (modify hypervisor and backend parameters) in parallel.
* For the maximum amount of instances in the cluster, submit stop,
start, reboot and reinstall jobs in parallel.
* For the maximum amount of instances in the cluster, submit multiple
list and info jobs in parallel.
* For the maximum amount of instances in the cluster, submit move
jobs in parallel. While the move operations are running, get
instance information using info jobs. Those jobs are required to
return within a reasonable low timeout.
* For the maximum amount of instances in the cluster, submit add-,
remove- and list-tags jobs.
* Submit 200 `gnt-debug delay` jobs with a delay of 1 seconds. To
speed up submission, perform multiple job submissions in parallel.
Verify that submitting jobs doesn't significantly slow down during
the process. Verify that querying cluster information over CLI and
RAPI succeeds in a timely fashion with the delay jobs
running/queued.
Parallel job execution performance
----------------------------------
Tests targeting the performance of parallel execution of "real" jobs
in close-to-production clusters should actually perform all operations,
such as creating disks and starting instances. This way, real world
locking or waiting issues can be reproduced. Performing all those
operations does requires quite some time though, so only a smaller
number of instances and parallel jobs can be tested realistically.
The following tests are added to the QA:
* Submitting twice as many instance creation request as there are
nodes in the cluster, using DRBD as disk template. As soon as a
creation job succeeds, submit a removal job for this instance.
* Create an instance using DRBD. Fail it over, migrate it, recreate
its disk, change its secondary node, reboot it and reinstall it
while creating an additional instance in parallel to each of those
operations.
Future work
===========
Based on test results of the tests listed above, additional tests can
be added to cover more real-world use-cases. Also, based on user
requests, specially crafted performance tests modeling those workloads
can be added too.
Additionally, the correlations between job submission time and job
queue size could be detected. Therefore, a snapshot of the job queue
before job submission could be taken to measure job submission time
based on the jobs in the queue.
.. vim: set textwidth=72 :
.. Local Variables:
.. mode: rst
.. fill-column: 72
.. End:
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment