1. 21 Apr, 2011 1 commit
  2. 20 Apr, 2011 3 commits
  3. 19 Apr, 2011 3 commits
    • Iustin Pop's avatar
      Fix master IP activation in failover with no-voting · 675e2bf5
      Iustin Pop authored
      
      
      Thanks to net.for.hub@gmail.com for reporting this. The logic in
      masterd.CheckMasterd did an early return in case of no_voting, hence
      skipping the master IP activation. We just change the ifs to not
      return but simply continue through the function.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      675e2bf5
    • Iustin Pop's avatar
      disk wiping: fix bug in chunk size computation · 6e7f0cd9
      Iustin Pop authored
      
      
      The current wipe_chunk_size computation is doing min(int_value,
      float_value). For small disks (below 10GiB), the actual formula will
      result into the float value being chosen. This results into very
      interesting behaviour:
      
      Wiping disk 0, offset 102.4, chunk 102.4
      Wiping disk 0, offset 204.8, chunk 102.4
      …
      Wiping disk 0, offset 921.6, chunk 102.4
      Wiping disk 0, offset 1024.0, chunk 1.13686837722e-13
      
      Since these are passed to dd via %d, this will result into the call to
      dd specifying offset 1024 and count 0, which will fail.
      
      We just need to enforce conversion to int, in order to not get bitten
      by floating point rounding errors.
      
      The patch also reorders some logging messages in order to log the
      chunk size.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      6e7f0cd9
    • Michael Hanselmann's avatar
      Fix bug in watcher · a0aa6b49
      Michael Hanselmann authored
      
      
      If “utils.RunParts” were to raise an exception, a log message was
      written and the code continued to run. Due to the exception the
      “results” variable would not be defined.
      
      Also change the code to log a backtrace (getting an exception is rather
      unlikely and having a backtrace is useful) and update one comment.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarRené Nussbaumer <rn@google.com>
      a0aa6b49
  4. 14 Apr, 2011 1 commit
  5. 13 Apr, 2011 4 commits
  6. 08 Apr, 2011 1 commit
  7. 07 Apr, 2011 1 commit
  8. 06 Apr, 2011 2 commits
    • Michael Hanselmann's avatar
      LUInstanceQueryData: Don't acquire locks unless requested · dae661a4
      Michael Hanselmann authored
      
      
      Until now LUInstanceQueryData always acquired locks for the instance(s)
      and nodes involved. In combination with long-running operations this
      prevented the use of “gnt-instance info”, even with the “--static”
      option. With this patch, locks are only acquired when explicitely
      requested in the opcode (like all query operations).
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      dae661a4
    • Iustin Pop's avatar
      Increase the lock timeouts before we block-acquire · d385a174
      Iustin Pop authored
      
      
      This has been observed to cause problems on real clusters via the
      following mechanism:
      
      - a long job (e.g. a replace-disks) is keeping an exclusive lock on an
        instance
      - the watcher starts and submits its query instances opcode which
        wants shared locks for all instances
      - after about an hour, the watcher job falls back to blocking acquire,
        after having acquired all other locks
      - any instance opcode that wants an exclusive lock for an instance
        cannot start until the watcher has finished, even though there's no
        actual operation on that instance
      
      In order to alleviate this problem, we simply increase the max timeout
      until lock acquires are sent back to either blocking acquire or
      priority increase. The timeout is computed such that we wait ~10 hours
      (instead of one) for this to happen, which should be within the
      maximum lifetime of a reasonable opcode on a healthy cluster. The
      timeout also means that priority increases will happen every half hour.
      
      We also increase the max wait interval to 15 seconds, otherwise we'd
      have too many retries with the increased interval.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      d385a174
  9. 04 Apr, 2011 2 commits
    • Iustin Pop's avatar
      daemon.py: move startup log message before prep_fn · fe295df3
      Iustin Pop authored
      
      
      Before this, the output in the rapi daemon log was:
      2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
      at /var/lib/ganeti/rapi/users
      2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
      startup
      
      Which is confusing, as it might look like the read of the users file
      is part of the previous run. This is because we log the 'daemon
      startup' message after the prepare_fn, which can log things on its
      own.
      
      The patch simply moves the 'daemon startup' message just before
      prepare_fn call.
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      fe295df3
    • Iustin Pop's avatar
      Display the actual memory values in N+1 failures · 0942620b
      Iustin Pop authored
      
      
      This changes the display from:
      Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
      Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
      accomodate instance failovers should node node1 fail
      
      To:
      
      Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
      Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
      accomodate instance failovers should node node1 fail (33536MiB needed,
      27910MiB available)
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      0942620b
  10. 31 Mar, 2011 1 commit
  11. 28 Mar, 2011 1 commit
  12. 24 Mar, 2011 2 commits
  13. 17 Mar, 2011 2 commits
  14. 16 Mar, 2011 1 commit
    • Michael Hanselmann's avatar
      locking: Fix race condition in lock monitor · e4e35357
      Michael Hanselmann authored
      
      
      In some rare cases it can happen that a lock is re-created very soon
      after deletion, while the old instance hasn't been destructed yet. In
      such a case the code would detect a duplicate name and raise an
      exception.
      
      We have seen at least one case where this happened during the creation
      of many instances. It is not exactly clear how it came to be, but it
      appears to have occurred while different jobs fought for locks with
      short timeouts (in the case of instance creation locks are added at this
      stage and removed shortly after if not all locks can be acquired).
      
      The issue is fixed by removing the check for duplicate names. To still
      guarantee a stable sort order for the lock information as shown by
      “gnt-debug locks”, a registration number is recorded for each lock in
      the monitor.
      
      A unittest is included to check for the situation.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      e4e35357
  15. 15 Mar, 2011 1 commit
  16. 11 Mar, 2011 2 commits
  17. 10 Mar, 2011 4 commits
  18. 09 Mar, 2011 1 commit
  19. 08 Mar, 2011 1 commit
    • Michael Hanselmann's avatar
      cfgupgrade: Fix critical bug overwriting RAPI users file · 87c80992
      Michael Hanselmann authored
      
      
      The cfgupgrade tool was designed to be idempotent, that means it could
      be run several times and still give produce the correct result. Ganeti
      2.4 moved the file containing the RAPI users to a separate directory
      (…/lib/ganeti/rapi/users). If it exists, cfgupgrade would automatically
      move an existing file from …/lib/ganeti/rapi_users and replace it with a
      symlink.
      
      Unfortunately one of the checks for this was incorrect and, when run
      multiple times, replaces the users file at the new location with a
      symlink created during a previous run.
      
      In addition the “--dry-run” parameter to cfgupgrade was not respected.
      Unittests are updated for all these cases.
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      87c80992
  20. 07 Mar, 2011 4 commits
  21. 04 Mar, 2011 2 commits