Skip to content
Snippets Groups Projects
  1. Jun 22, 2011
  2. Apr 28, 2011
  3. Apr 13, 2011
  4. Apr 06, 2011
    • Iustin Pop's avatar
      Increase the lock timeouts before we block-acquire · d385a174
      Iustin Pop authored
      
      This has been observed to cause problems on real clusters via the
      following mechanism:
      
      - a long job (e.g. a replace-disks) is keeping an exclusive lock on an
        instance
      - the watcher starts and submits its query instances opcode which
        wants shared locks for all instances
      - after about an hour, the watcher job falls back to blocking acquire,
        after having acquired all other locks
      - any instance opcode that wants an exclusive lock for an instance
        cannot start until the watcher has finished, even though there's no
        actual operation on that instance
      
      In order to alleviate this problem, we simply increase the max timeout
      until lock acquires are sent back to either blocking acquire or
      priority increase. The timeout is computed such that we wait ~10 hours
      (instead of one) for this to happen, which should be within the
      maximum lifetime of a reasonable opcode on a healthy cluster. The
      timeout also means that priority increases will happen every half hour.
      
      We also increase the max wait interval to 15 seconds, otherwise we'd
      have too many retries with the increased interval.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      d385a174
  5. Mar 16, 2011
    • Michael Hanselmann's avatar
      locking: Fix race condition in lock monitor · e4e35357
      Michael Hanselmann authored
      
      In some rare cases it can happen that a lock is re-created very soon
      after deletion, while the old instance hasn't been destructed yet. In
      such a case the code would detect a duplicate name and raise an
      exception.
      
      We have seen at least one case where this happened during the creation
      of many instances. It is not exactly clear how it came to be, but it
      appears to have occurred while different jobs fought for locks with
      short timeouts (in the case of instance creation locks are added at this
      stage and removed shortly after if not all locks can be acquired).
      
      The issue is fixed by removing the check for duplicate names. To still
      guarantee a stable sort order for the lock information as shown by
      “gnt-debug locks”, a registration number is recorded for each lock in
      the monitor.
      
      A unittest is included to check for the situation.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      e4e35357
  6. Mar 15, 2011
  7. Mar 08, 2011
    • Michael Hanselmann's avatar
      cfgupgrade: Fix critical bug overwriting RAPI users file · 87c80992
      Michael Hanselmann authored
      
      The cfgupgrade tool was designed to be idempotent, that means it could
      be run several times and still give produce the correct result. Ganeti
      2.4 moved the file containing the RAPI users to a separate directory
      (…/lib/ganeti/rapi/users). If it exists, cfgupgrade would automatically
      move an existing file from …/lib/ganeti/rapi_users and replace it with a
      symlink.
      
      Unfortunately one of the checks for this was incorrect and, when run
      multiple times, replaces the users file at the new location with a
      symlink created during a previous run.
      
      In addition the “--dry-run” parameter to cfgupgrade was not respected.
      Unittests are updated for all these cases.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      87c80992
  8. Feb 23, 2011
  9. Feb 18, 2011
  10. Feb 17, 2011
    • Iustin Pop's avatar
      NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes · effab4ca
      Iustin Pop authored
      
      Since we don't have the data per design, UNAVAIL is appropriate here,
      while NODATA is not.
      
      The patch also adds a comment: if we extend the live fields list to
      contain other data in the future, we need to reevaluate this solution.
      
      This should fix issue 143. The listing now shows (node2==ofline,
      node3==not vm_capable):
      
        Node     DTotal     DFree    MTotal     MNode     MFree Pinst Sinst
        node1    698.6G    630.5G     32.0G      1.0G     30.0G     8     7
        node2 (offline) (offline) (offline) (offline) (offline)     9     4
        node3 (unavail) (unavail) (unavail) (unavail) (unavail)     0     0
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      effab4ca
  11. Feb 02, 2011
  12. Jan 31, 2011
  13. Jan 28, 2011
  14. Jan 27, 2011
  15. Jan 21, 2011
  16. Jan 20, 2011
  17. Jan 18, 2011
  18. Jan 14, 2011
Loading