1. 11 Apr, 2012 6 commits
  2. 30 Mar, 2012 1 commit
  3. 29 Mar, 2012 1 commit
    • Dimitris Aragiorgis's avatar
      Fix a bug concerning TCP port release · 3b3b1bca
      Dimitris Aragiorgis authored
      Commit f396ad8c
      
       returns the TCP port used by DRBD disk back to the
      TCP/UDP port pool using AddTcpUdpPort().
      
      However, AddTcpUdpPort() writes the config on every invocation,
      using _WriteConfig(). This causes two problems:
      
       * it causes critical errors logged by VerifyConfig(), after the DRBD
         disk removal, and until the actual instance removal.
       * if the code following AddTcpUdpPort() fails, the port is already
         returned back the pool, which causes the port to have duplicates
         (inconsistent config).
      
      AddTcpUdpPort() is invoked in three cases:
      
       * during InstanceRemove() through _RemoveDisks().
       * during InstanceSetParams() in case of disk removal.
       * during InstanceSetParams() through _ConvertDrbdToPlain().
      
      This commit fixes the problem by removing the _WriteConfig() call from
      AddTcpUdpPort(), delegate it to Update() via the
      TemporaryReservationManager and ensure AddTcpUdpPort() precedes
      Update().
      Signed-off-by: default avatarDimitris Aragiorgis <dimara@grnet.gr>
      [iustin@google.com: small comments adjustements]
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      3b3b1bca
  4. 28 Mar, 2012 1 commit
  5. 23 Mar, 2012 2 commits
  6. 22 Mar, 2012 3 commits
  7. 21 Mar, 2012 1 commit
  8. 20 Mar, 2012 1 commit
    • Michael Hanselmann's avatar
      Stop acquiring BGL for LUXI queries · 0fa753ba
      Michael Hanselmann authored
      
      
      Short description: This fixes an issue whereby masterd would become
      unresponsive on the LUXI socket, leading to client timeouts. While made
      worse in 2.5, the underlying issue was already present in 2.4.
      
      Longer description: Until now all LUXI queries would acquire the BGL
      (big Ganeti lock) in shared mode. With the exception of OpNodeAdd and
      OpNodeRemove, this was also the case for all opcodes before version 2.5.
      In 2.5 we split OpClusterVerify into multiple opcodes, one of which
      (OpClusterVerifyConfig) now acquires the BGL in exclusive mode. Whether
      or not doing so is good is a separate discussion: OpNodeAdd and
      OpNodeRemove, as of this writing, still require an exclusive BGL.
      OpClusterVerifyConfig is run more often than OpNodeAdd or OpNodeRemove
      in normal clusters, which is why we only recognized this issue in 2.5.
      
      What would happen is that once OpClusterVerifyConfig tried to acquire
      its exclusive BGL while it was actually held by other opcodes (e.g.
      OpInstanceReplaceDisks), the locking code would not grant shared
      acquires for the BGL, even when the exclusive acquire is removed from
      the queue for a short amount of time after a timeout. This is necessary
      to prevent lock starvation.
      
      In this situation further LUXI queries requiring the BGL in shared mode,
      e.g. OpClusterQuery, would block and the client eventually time out.
      Over time they fill the client request workerpool's queue and at that
      point even requests not requiring the BGL stop working. Once the
      long-running operation(s) holding the BGL in shared mode finished,
      OpClusterVerifyConfig gets it in exclusive mode and everything returns
      to normal. LUXI recovers very soon too.
      
      I'd like to thank Bernardo Dal Seno for his contribution to this bugfix.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarBernardo Dal Seno <bdalseno@google.com>
      0fa753ba
  9. 19 Mar, 2012 1 commit
  10. 20 Feb, 2012 1 commit
    • Iustin Pop's avatar
      Fix Makefile.am compatibility with automake 1.11.2 · b8fe7ca6
      Iustin Pop authored
      
      
      Automake 1.11.2 made the following change:
      
      * Long-standing bugs:
        - Automake now warns about more primary/directory invalid combinations,
          such as "doc_LIBRARIES" or "pkglib_PROGRAMS".
      
      Unfortunately, this breaks our Makefile.am (issue 216) exactly because
      we were relying on pkglib_SCRIPTS.
      
      This patch works around this by adding a new myexeclibdir variable
      (exec so that it is intalled at `install-exec` time, the same as the
      pkglibdir), and switches to that.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      b8fe7ca6
  11. 31 Jan, 2012 1 commit
  12. 26 Jan, 2012 3 commits
  13. 25 Jan, 2012 1 commit
    • Michael Hanselmann's avatar
      Fix cluster verification issues on multi-group clusters · 2c2f257d
      Michael Hanselmann authored
      
      
      This patch attempts to fix a number of issues with “gnt-cluster verify”
      in presence of multiple node groups and DRBD8 instances split over nodes
      in more than one group.
      
      - Look up instances in a group only by their primary node (otherwise
        split instances would be considered when verifying any of their node's
        groups)
      - When gathering additional nodes for LV checks, just compare instance's
        node's groups with the currently verified group instead of comparing
        against the primary node's group
      - Exclude nodes in other groups when calculating N+1 errors and checking
        logical volumes
      
      Not directly related, but a small error text is also clarified.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      2c2f257d
  14. 20 Jan, 2012 1 commit
  15. 09 Jan, 2012 2 commits
  16. 06 Jan, 2012 1 commit
  17. 21 Dec, 2011 2 commits
  18. 30 Nov, 2011 2 commits
  19. 24 Nov, 2011 5 commits
    • Michael Hanselmann's avatar
      ConfigWriter: Fix epydoc error · 1d4930b9
      Michael Hanselmann authored
      
      
      The parameter is called “mods”, not “modes”.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarAndrea Spadaccini <spadaccio@google.com>
      (cherry picked from commit 1730d4a1)
      1d4930b9
    • Michael Hanselmann's avatar
      ConfigWriter: Fix epydoc error · 1730d4a1
      Michael Hanselmann authored
      
      
      The parameter is called “mods”, not “modes”.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarAndrea Spadaccini <spadaccio@google.com>
      1730d4a1
    • Michael Hanselmann's avatar
      LUGroupAssignNodes: Fix node membership corruption · 54c31fd3
      Michael Hanselmann authored
      
      
      Note: This bug only manifests itself in Ganeti 2.5, but since the
      problematic code also exists in 2.4, I decided to fix it there.
      
      If a node was assigned to a new group using “gnt-group assign-nodes” the
      node object's group would be changed, but not the duplicate member list
      in the group object. The latter is an optimization to require fewer
      locks for other operations. The per-group member list is only kept in
      memory and not written to disk.
      
      Ganeti 2.5 starts to make use of the data kept in the per-group member
      list and consequently fails when it is out of date. The following
      commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was
      confirmed using additional logging):
      
        $ gnt-group add foo
        $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name)
        $ gnt-cluster verify  # Fails with KeyError
      
      This patch moves the code modifying node and group objects into
      “config.ConfigWriter” to do the complete operation under the config
      lock, and also to avoid making use of side-effects of modifying objects
      without calling “ConfigWriter.Update”. A unittest is included.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      (cherry picked from commit 218f4c3d)
      54c31fd3
    • Michael Hanselmann's avatar
      LUGroupAssignNodes: Fix node membership corruption · 218f4c3d
      Michael Hanselmann authored
      
      
      Note: This bug only manifests itself in Ganeti 2.5, but since the
      problematic code also exists in 2.4, I decided to fix it there.
      
      If a node was assigned to a new group using “gnt-group assign-nodes” the
      node object's group would be changed, but not the duplicate member list
      in the group object. The latter is an optimization to require fewer
      locks for other operations. The per-group member list is only kept in
      memory and not written to disk.
      
      Ganeti 2.5 starts to make use of the data kept in the per-group member
      list and consequently fails when it is out of date. The following
      commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was
      confirmed using additional logging):
      
        $ gnt-group add foo
        $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name)
        $ gnt-cluster verify  # Fails with KeyError
      
      This patch moves the code modifying node and group objects into
      “config.ConfigWriter” to do the complete operation under the config
      lock, and also to avoid making use of side-effects of modifying objects
      without calling “ConfigWriter.Update”. A unittest is included.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      218f4c3d
    • Michael Hanselmann's avatar
      Fix pylint warning on unreachable code · 9c4f4dd6
      Michael Hanselmann authored
      Commit c50452c3
      
       added an exception when all instances should be
      evacuated off a node, but did so in a way which made pylint complain
      about unreachable code.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      9c4f4dd6
  20. 23 Nov, 2011 3 commits
  21. 16 Nov, 2011 1 commit
    • Iustin Pop's avatar
      htools: rework message display construction · bdd8c739
      Iustin Pop authored
      
      
      While diagnosing some (unrelated) memory usage in htools, I've
      stumbled upon some very bad behaviour in checkData: mapAccum is
      non-strict, and the tuple we use also, so that results in the list of
      list of messages being very bad space-wise (hundreds of MB of memory
      for a simulated cluster with thousands of nodes, all with errors).
      
      The new, explicit reuse of the old message list has a linear memory
      behaviour. The only downside is that messages are listed in the
      reverse order (which I'll fix on master).
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      bdd8c739