Skip to content
Snippets Groups Projects
  1. Feb 18, 2011
  2. Jan 28, 2011
  3. Jan 12, 2011
    • Iustin Pop's avatar
      Run pylint over QA code too · 3582eef6
      Iustin Pop authored
      
      Right now, the QA code is not covered by pylint, and this shows at
      least one low-impact bug.
      
      This patch does the necessary changes to make QA pylint-clean, and the
      changes the makefile to run pylint for it.
      
      Notable changes:
      
      - qa_utils.GenericQueryTest: randfields was not used at all, and my
        belief is that it was indented to be used in order not to modify the
        input list; so I replaced randfields with fields, so we only shuffle
        the our local copy
      - qa_node.TestOutOfBand was using it's own copy of AcquireNode(), so I
        replaced it with the existing version
      - qa_os: was using 'dir' in a couple of places, replaced with dirname
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      3582eef6
    • Iustin Pop's avatar
      QA: use a persistent SSH connection to the master · f7e6f3c8
      Iustin Pop authored
      
      The recent additions to QA (many more tests) make QA slow if the
      machine on which the QA runs is not very close to the tested nodes —
      or in general, when the SSH handhaske is costly.
      
      We discussed before about using a persistent connection, and here is
      the patch that implements it. On a very small QA (very very small), it
      cuts down a lot of time (almost half), so it should be useful even for
      a full QA.
      
      I've also thought about changing from external ssh to paramiko, but I
      estimated that it would be more work to correctly interleave the IO
      from the remote process than just running a background SSH.
      
      Also note that yes, the global dict is ugly, but I don't know of
      another simple way to implement this.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      f7e6f3c8
    • Iustin Pop's avatar
      QA: Fix duplicated OOB tests · 69df9d2b
      Iustin Pop authored
      
      Patch f55312bd added the OOB tests to TestClusterVerify, which is not
      actually a test for cluster verify, but a runner for cluster verify
      that is called multiple times, for each instance type, etc. This led
      to running the OOB commands multiple times, which is painful
      especially as this is a slow test.
      
      The patch moves this to a separate test, that is run only once.
      
      Furthermore, the way that data files are copied around is very
      inefficient: touch + mv + chmod + mv + rm for each node (5 times
      number of nodes), whereas it could be simply: touch on master, chmod
      on master, cluster copyfile, chmod on master, cluster copyfile,
      cluster command rm, i.e. only 5 fixed ssh calls to the master. The
      code is changed as such, for increased speed.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      69df9d2b
  4. Jan 10, 2011
  5. Jan 06, 2011
  6. Dec 20, 2010
  7. Dec 17, 2010
  8. Dec 16, 2010
  9. Dec 14, 2010
  10. Dec 13, 2010
  11. Dec 10, 2010
  12. Dec 09, 2010
  13. Dec 08, 2010
  14. Dec 01, 2010
  15. Nov 30, 2010
    • Iustin Pop's avatar
      Further cleanups on QA · 7d88f255
      Iustin Pop authored
      
      This is more of an RFC. The patch attempts to address two issues:
      
      - running conditional tests is ugly right now
      - we don't know what tests we skipped
      
      By using the new RunTestIf, we solve both. But a significant number of
      test decisions are more complex than just “is test enabled”, so those
      remain to be run via RunTest, which means we don't get logging of when
      they're not run. Hence the logging is not complete… Sugesstions on how
      to solve it are welcome.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      7d88f255
  16. Nov 17, 2010
  17. Nov 03, 2010
  18. Oct 28, 2010
  19. Oct 25, 2010
  20. Oct 20, 2010
  21. Oct 14, 2010
    • Iustin Pop's avatar
      Brown-bag fix for leftover comment · 76917d97
      Iustin Pop authored
      
      I did forgot this in the original patch. Sorry!!!!
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      76917d97
    • Iustin Pop's avatar
      Rework QA interaction with the watcher · 8201b996
      Iustin Pop authored
      
      The interaction with cron-launched watcher is a well-known failure mode of QA:
      
      ---- 2010-10-14 06:54:55.464839 time=0:00:56.764827 Test tools/move-instance
      
      For the following tests it's recommended to turn off the ganeti-watcher cronjob.
      
      ---- 2010-10-14 06:54:55.465255 start Test automatic restart of instance by ganeti-watcher
      …
      Error: Domain 'instance1' does not exist.
      Command: ssh -oEscapeChar=none -oBatchMode=yes -l root -t -oStrictHostKeyChecking=yes
        -oClearAllForwardings=yes -oForwardAgent=yes node2 'ganeti-watcher -d'
      2010-10-13 23:55:04,479:  pid=1659 ganeti-watcher:626
       ERROR Can't acquire lock on state file /var/lib/ganeti/watcher.data: File already locked
      ---- 2010-10-14 06:55:04.513948 time=0:00:09.048693 Test automatic restart of instance by ganeti-watcher
      
      In order to fix this, we disable the watcher during these tests, and
      re-enable it afterwards. To protect against watcher being disabled, we
      enable it unconditionally at the start of the QA (we do want it enabled,
      in order to see the interaction between the watcher and many
      creation/disk replace jobs, etc.).
      
      Note: even after this patch, if a cron-watcher was started and is still
      running during the test, we'll have locking issues. I think for now this
      is OK, we'll have to see how often that happens.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      8201b996
  22. Oct 08, 2010
    • Iustin Pop's avatar
      Change QA log output · f89d59b9
      Iustin Pop authored
      
      Currently, the logging in QA doesn't show the duration of the various
      steps, and if it is needed one has to perform log manipulation. This
      patch changes the output so that the log informatio is line based (as
      opposed to block-based), such that it's easy to grep for all log lines:
      
      ./qa/ganeti-qa.py --yes-do-it qa.json  2>&1|grep ^----
      ---- 2010-10-08 14:40:21.730382 start Test SSH connection --------------
      ---- 2010-10-08 14:40:23.156633 time=0:00:01.426251 Test SSH connection
      ---- 2010-10-08 14:40:23.156735 start ICMP ping each node --------------
      ---- 2010-10-08 14:40:24.230479 time=0:00:01.073744 ICMP ping each node
      ---- 2010-10-08 14:40:24.230583 start Test availibility of Ganeti commands
      ---- 2010-10-08 14:40:32.314586 time=0:00:08.084003 Test availibility of Ganeti commands
      ---- 2010-10-08 14:40:32.314734 start gnt-node info --------------------
      ---- 2010-10-08 14:40:32.860884 time=0:00:00.546150 gnt-node info ------
      
      or just for the duration of the steps:
      ./qa/ganeti-qa.py --yes-do-it ../qa-mpgntac5.fra.json  2>&1|grep ^----.*time=
      ---- 2010-10-08 14:42:12.630067 time=0:00:01.239256 Test SSH connection
      ---- 2010-10-08 14:42:14.204393 time=0:00:01.574221 ICMP ping each node
      ---- 2010-10-08 14:42:22.170828 time=0:00:07.966331 Test availibility of Ganeti commands
      ---- 2010-10-08 14:42:22.701030 time=0:00:00.530037 gnt-node info ------
      
      This will help with identifying slow steps or even graphing the QA
      duration.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      f89d59b9
  23. Oct 07, 2010
    • Iustin Pop's avatar
      Try again to fix the inter-cluster move QA test · 638a7266
      Iustin Pop authored
      
      This time, we re-establish the old pri/sec nodes corretly. Unfortunately this
      will require now a 3-node cluster at least for drbd instances, hence it's
      somewhat suboptimal, but… The other option would be to move it simply from p:s
      to s:p and then back to p:s, without involving a third node (for DRBD case),
      but I think that moving it to a completely separate node is slightly better for
      testing.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      638a7266
  24. Oct 06, 2010
    • Iustin Pop's avatar
      QA: Fix instance move tests · 677e16eb
      Iustin Pop authored
      
      The instance move tests were moving the instance from node pair (A,_) to
      (B, A), and left it there. This patch makes sure that the first step
      moves the instance to (B,A) but the second one back to (A,B), so that
      the instance is left on the same primary node.
      
      The original secondary node is lost though, if I read the code
      correctly.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      677e16eb
  25. Sep 30, 2010
  26. Aug 19, 2010
  27. Aug 18, 2010
  28. Aug 10, 2010
  29. Jul 29, 2010
  30. Jul 26, 2010
  31. Jul 01, 2010
    • Michael Hanselmann's avatar
      RAPI client: Switch to pycURL · 2a7c3583
      Michael Hanselmann authored
      
      Currently the RAPI client uses the urllib2 and httplib modules from
      Python's standard library. They're used with pyOpenSSL in a very fragile
      way, and there are known issues when receiving large responses from a RAPI
      server.
      
      By switching to PycURL we leverage the power and stability of the
      widely-used curl library (libcurl). This brings us much more flexibility
      than before, and timeouts were easily implemented (something that would
      have involved a lot of work with the built-in modules).
      
      There's one small drawback: Programs using libcurl have to call
      curl_global_init(3) (available as pycurl.global_init) while exactly one
      thread is running (e.g. before other threads) and are supposed to call
      curl_global_cleanup(3) (available as pycurl.global_cleanup) upon exiting.
      See the manpages for details. A decorator is provided to simplify this.
      
      Unittests for the new code are provided, increasing the test coverage of
      the RAPI client from 74% to 89%.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      2a7c3583
    • Guido Trotter's avatar
      qa: shutdown instance before trying disk convert · f9f0ce7f
      Guido Trotter authored
      
      Because we have to. :)
      
      Signed-off-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      f9f0ce7f
Loading