1. 19 Mar, 2012 1 commit
  2. 31 Jan, 2012 1 commit
  3. 26 Jan, 2012 1 commit
  4. 25 Jan, 2012 1 commit
    • Michael Hanselmann's avatar
      Fix cluster verification issues on multi-group clusters · 2c2f257d
      Michael Hanselmann authored
      
      
      This patch attempts to fix a number of issues with “gnt-cluster verify”
      in presence of multiple node groups and DRBD8 instances split over nodes
      in more than one group.
      
      - Look up instances in a group only by their primary node (otherwise
        split instances would be considered when verifying any of their node's
        groups)
      - When gathering additional nodes for LV checks, just compare instance's
        node's groups with the currently verified group instead of comparing
        against the primary node's group
      - Exclude nodes in other groups when calculating N+1 errors and checking
        logical volumes
      
      Not directly related, but a small error text is also clarified.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      2c2f257d
  5. 20 Jan, 2012 1 commit
  6. 06 Jan, 2012 1 commit
  7. 21 Dec, 2011 2 commits
  8. 30 Nov, 2011 1 commit
  9. 24 Nov, 2011 5 commits
    • Michael Hanselmann's avatar
      ConfigWriter: Fix epydoc error · 1d4930b9
      Michael Hanselmann authored
      
      
      The parameter is called “mods”, not “modes”.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarAndrea Spadaccini <spadaccio@google.com>
      (cherry picked from commit 1730d4a1)
      1d4930b9
    • Michael Hanselmann's avatar
      ConfigWriter: Fix epydoc error · 1730d4a1
      Michael Hanselmann authored
      
      
      The parameter is called “mods”, not “modes”.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarAndrea Spadaccini <spadaccio@google.com>
      1730d4a1
    • Michael Hanselmann's avatar
      LUGroupAssignNodes: Fix node membership corruption · 54c31fd3
      Michael Hanselmann authored
      
      
      Note: This bug only manifests itself in Ganeti 2.5, but since the
      problematic code also exists in 2.4, I decided to fix it there.
      
      If a node was assigned to a new group using “gnt-group assign-nodes” the
      node object's group would be changed, but not the duplicate member list
      in the group object. The latter is an optimization to require fewer
      locks for other operations. The per-group member list is only kept in
      memory and not written to disk.
      
      Ganeti 2.5 starts to make use of the data kept in the per-group member
      list and consequently fails when it is out of date. The following
      commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was
      confirmed using additional logging):
      
        $ gnt-group add foo
        $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name)
        $ gnt-cluster verify  # Fails with KeyError
      
      This patch moves the code modifying node and group objects into
      “config.ConfigWriter” to do the complete operation under the config
      lock, and also to avoid making use of side-effects of modifying objects
      without calling “ConfigWriter.Update”. A unittest is included.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      (cherry picked from commit 218f4c3d)
      54c31fd3
    • Michael Hanselmann's avatar
      LUGroupAssignNodes: Fix node membership corruption · 218f4c3d
      Michael Hanselmann authored
      
      
      Note: This bug only manifests itself in Ganeti 2.5, but since the
      problematic code also exists in 2.4, I decided to fix it there.
      
      If a node was assigned to a new group using “gnt-group assign-nodes” the
      node object's group would be changed, but not the duplicate member list
      in the group object. The latter is an optimization to require fewer
      locks for other operations. The per-group member list is only kept in
      memory and not written to disk.
      
      Ganeti 2.5 starts to make use of the data kept in the per-group member
      list and consequently fails when it is out of date. The following
      commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was
      confirmed using additional logging):
      
        $ gnt-group add foo
        $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name)
        $ gnt-cluster verify  # Fails with KeyError
      
      This patch moves the code modifying node and group objects into
      “config.ConfigWriter” to do the complete operation under the config
      lock, and also to avoid making use of side-effects of modifying objects
      without calling “ConfigWriter.Update”. A unittest is included.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      218f4c3d
    • Michael Hanselmann's avatar
      Fix pylint warning on unreachable code · 9c4f4dd6
      Michael Hanselmann authored
      Commit c50452c3
      
       added an exception when all instances should be
      evacuated off a node, but did so in a way which made pylint complain
      about unreachable code.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      9c4f4dd6
  10. 23 Nov, 2011 3 commits
  11. 15 Nov, 2011 1 commit
  12. 14 Nov, 2011 1 commit
  13. 08 Nov, 2011 1 commit
  14. 04 Nov, 2011 2 commits
  15. 27 Oct, 2011 1 commit
  16. 20 Oct, 2011 1 commit
  17. 19 Oct, 2011 2 commits
  18. 18 Oct, 2011 1 commit
  19. 17 Oct, 2011 1 commit
  20. 12 Oct, 2011 1 commit
    • Michael Hanselmann's avatar
      rpc: Disable HTTP client pool and reduce memory consumption · 05927995
      Michael Hanselmann authored
      
      
      We noticed that “ganeti-masterd” can use large amounts of memory,
      especially on large clusters. Measurements showed a single PycURL client
      using about 500 kB of heap memory (the actual usage depends on versions,
      build options and settings).
      
      The RPC client uses a per-thread HTTP client pool with one client per
      node. At this time there are 41 non-main threads (25 for the job queue
      and 16 for client requests). This means the HTTP client pools use a lot
      of memory (ca. 200 MB for 10 nodes, ca. 1 GB for 50 nodes).
      
      This patch disables the per-thread HTTP client pool. No cleanup of
      unused code is done. That will be done in the master branch only.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      05927995
  21. 04 Oct, 2011 1 commit
  22. 03 Oct, 2011 2 commits
  23. 30 Sep, 2011 4 commits
    • Michael Hanselmann's avatar
      LUClusterVerifyGroup: Spread SSH checks over more nodes · 64c7b383
      Michael Hanselmann authored
      
      
      When verifying a group the code would always check SSH to all nodes in
      the same group, as well as the first node for every other group. On big
      clusters this can cause issues since many nodes will try to connect to
      the first node of another group at the same time. This patch changes the
      algorithm to choose a different node every time.
      
      A unittest for the selection algorithm is included.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      64c7b383
    • Iustin Pop's avatar
      Optimise cli.JobExecutor with many pending jobs · 11705e3d
      Iustin Pop authored
      
      
      In the case we submit many pending jobs (> 100) to the masterd, the
      JobExecutor 'spams' the master daemon with status requests for the
      status of all the jobs, even though in the end it will only choose a
      single job for polling.
      
      This is very sub-optimal, because when the master is busy processing
      small/fast jobs, this query forces reading all the jobs from
      this. Restricting the 'window' of jobs that we query from the entire
      set to a smaller subset makes a huge difference (masterd only, 0s
      delay jobs, all jobs to tmpfs thus no I/O involved):
      
      - submitting/waiting for 500 jobs:
        - before: ~21 s
        - after:   ~5 s
      - submitting/waiting for 1K jobs:
        - before: ~76 s
        - after:   ~8 s
      
      This is with a batch of 25 jobs. With a batch of 50 jobs, it goes from
      8s to 12s. I think that choosing the 'best' job for nice output only
      matters with a small number of jobs, and that for more than that
      people will not actually watch the jobs. So changing from 'perfect
      job' to 'best job in the first 25' should be OK.
      
      Note that most jobs won't execute as fast as 0 delay, but this is
      still a good improvement.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      11705e3d
    • Michael Hanselmann's avatar
      9dc45ab1
    • Michael Hanselmann's avatar
      utils.log: Write error messages to stderr · 34aa8b7c
      Michael Hanselmann authored
      
      
      When “gnt-cluster copyfile” failed it would only print “Copy of file …
      to node … failed”. A detailed message is written using logging.error.
      Writing error messages to stderr can be helpful in figuring out what
      went wrong (the messages also go to the log file, but not everyone might
      know about it).
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      34aa8b7c
  24. 28 Sep, 2011 2 commits
  25. 22 Sep, 2011 2 commits