Skip to content
Snippets Groups Projects
  1. Apr 06, 2011
    • Iustin Pop's avatar
      Increase the lock timeouts before we block-acquire · d385a174
      Iustin Pop authored
      
      This has been observed to cause problems on real clusters via the
      following mechanism:
      
      - a long job (e.g. a replace-disks) is keeping an exclusive lock on an
        instance
      - the watcher starts and submits its query instances opcode which
        wants shared locks for all instances
      - after about an hour, the watcher job falls back to blocking acquire,
        after having acquired all other locks
      - any instance opcode that wants an exclusive lock for an instance
        cannot start until the watcher has finished, even though there's no
        actual operation on that instance
      
      In order to alleviate this problem, we simply increase the max timeout
      until lock acquires are sent back to either blocking acquire or
      priority increase. The timeout is computed such that we wait ~10 hours
      (instead of one) for this to happen, which should be within the
      maximum lifetime of a reasonable opcode on a healthy cluster. The
      timeout also means that priority increases will happen every half hour.
      
      We also increase the max wait interval to 15 seconds, otherwise we'd
      have too many retries with the increased interval.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      d385a174
  2. Apr 04, 2011
    • Iustin Pop's avatar
      daemon.py: move startup log message before prep_fn · fe295df3
      Iustin Pop authored
      
      Before this, the output in the rapi daemon log was:
      2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
      at /var/lib/ganeti/rapi/users
      2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
      startup
      
      Which is confusing, as it might look like the read of the users file
      is part of the previous run. This is because we log the 'daemon
      startup' message after the prepare_fn, which can log things on its
      own.
      
      The patch simply moves the 'daemon startup' message just before
      prepare_fn call.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      fe295df3
    • Iustin Pop's avatar
      Display the actual memory values in N+1 failures · 0942620b
      Iustin Pop authored
      
      This changes the display from:
      Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
      Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
      accomodate instance failovers should node node1 fail
      
      To:
      
      Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
      Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
      accomodate instance failovers should node node1 fail (33536MiB needed,
      27910MiB available)
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      0942620b
  3. Mar 31, 2011
  4. Mar 24, 2011
    • Michael Hanselmann's avatar
      Fix output for “gnt-job info” · d1b47b16
      Michael Hanselmann authored
      
      If the result of an opcode was a non-empty dictionary, it
      would be impossible to differenciate between input and result:
      
        Input fields:
          […]
          debug_level: 0
          fields: cluster_name,master_node,volume_group_name
          jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]
      
      Expected output:
      
        Input fields:
          […]
          debug_level: 0
          fields: cluster_name,master_node,volume_group_name
        Result:
          jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      d1b47b16
  5. Mar 17, 2011
  6. Mar 16, 2011
    • Michael Hanselmann's avatar
      locking: Fix race condition in lock monitor · e4e35357
      Michael Hanselmann authored
      
      In some rare cases it can happen that a lock is re-created very soon
      after deletion, while the old instance hasn't been destructed yet. In
      such a case the code would detect a duplicate name and raise an
      exception.
      
      We have seen at least one case where this happened during the creation
      of many instances. It is not exactly clear how it came to be, but it
      appears to have occurred while different jobs fought for locks with
      short timeouts (in the case of instance creation locks are added at this
      stage and removed shortly after if not all locks can be acquired).
      
      The issue is fixed by removing the check for duplicate names. To still
      guarantee a stable sort order for the lock information as shown by
      “gnt-debug locks”, a registration number is recorded for each lock in
      the monitor.
      
      A unittest is included to check for the situation.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      e4e35357
  7. Mar 15, 2011
  8. Mar 04, 2011
  9. Feb 28, 2011
  10. Feb 25, 2011
  11. Feb 24, 2011
  12. Feb 22, 2011
  13. Feb 18, 2011
  14. Feb 17, 2011
  15. Feb 10, 2011
  16. Feb 09, 2011
    • Iustin Pop's avatar
      Fix error msg for instances on offline nodes · 11dcce87
      Iustin Pop authored
      
      Currently, for both primary and secondary offline nodes, we give the
      same message:
      - ERROR: instance instance14: instance lives on offline node(s) node3
      - ERROR: instance instance15: instance lives on offline node(s) node3
      - ERROR: instance instance16: instance lives on offline node(s) node3
      - ERROR: instance instance17: instance lives on offline node(s) node3
      
      This is confusing, as an offline primary is in a different category
      than a secondary. The patch changes the warnings to have different
      error messages:
      - ERROR: instance instance14: instance has offline secondary node(s) node3
      - ERROR: instance instance15: instance has offline secondary node(s) node3
      - ERROR: instance instance16: instance lives on offline node node3
      - ERROR: instance instance17: instance lives on offline node node3
      
      Thanks to Alexander Schreiber <als@google.com> for reporting this
      issue.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarAlexander Schreiber <als@google.com>
      11dcce87
    • Iustin Pop's avatar
      cluster verify and instance disks on offline nodes · a3de2ae7
      Iustin Pop authored
      
      Currently, cluster-verify says:
      
      - ERROR: instance instance14: couldn't retrieve status for disk/0 on node3: node offline
      - ERROR: instance instance14: instance lives on offline node(s) node3
      - ERROR: instance instance15: couldn't retrieve status for disk/0 on node3: node offline
      - ERROR: instance instance15: instance lives on offline node(s) node3
      
      This is redundant as the “lives on offline node” message should be all we need to
      understand the cluster situation.
      
      The patch fixes this and also corrects a very old idiom.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarStephen Shirley <diamond@google.com>
      a3de2ae7
    • Iustin Pop's avatar
      Cluster verify and N+1 warnings for offline nodes · f7661f6b
      Iustin Pop authored
      
      Currently, cluster verify shows warnings N+1 warnings for offline
      nodes having any redundant instances since the memory data that we
      have for those nodes is zero, so any instance will trigger the
      warning.
      
      As the comment says, we already list secondary instances on offline
      nodes, so that warning is enough, and we skip the N+1 one.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarStephen Shirley <diamond@google.com>
      f7661f6b
  17. Feb 08, 2011
  18. Feb 04, 2011
  19. Feb 03, 2011
    • Iustin Pop's avatar
      Bump up intra-cluster import connect timeout · 81635b5a
      Iustin Pop authored
      
      Currently, the export timeout is 10 times 20 seconds, but the import
      is only 30 seconds. I'm raising this to 60 seconds with two goals in
      mind:
      
      - when debugging manually, this allows for easier synchronisation of
        the processes
      - 60 equals to 3 full 20 second intervals, which I think is better
        than just one an a half
      
      This change shouldn't make a big difference either way (at most, it
      will possibly delay the job in case of failures by half a minute).
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      81635b5a
    • Iustin Pop's avatar
      Import-export: fix logging of daemon output · c9300bb3
      Iustin Pop authored
      
      In case of failures, the recent daemon output is logged as %r on a
      list of unicode strings, which results in the (ugly):
      
      Thu Feb  3 05:13:34 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: [u'  DUMP: Date of this level 0 dump: Thu Feb  3 05:13:18 2011', u'  DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output', u'  DUMP: Label: none', u'  DUMP: Writing 10 Kilobyte records', u'  DUMP: mapping (Pass I) [regular files]', u'  DUMP: mapping (Pass II) [directories]', u'  DUMP: estimated 54301 blocks.', u'  DUMP: Volume 1 started with block 1 at: Thu Feb  3 05:13:19 2011', u'  DUMP: dumping (Pass III) [directories]', u'  DUMP: dumping (Pass IV) [regular files]', u'socat: E SSL_write(): Connection reset by peer', u"dd: dd: writing `standard output': Broken pipe", u'  DUMP: Broken pipe', u'  DUMP: The ENTIRE dump is aborted.'])
      
      This patch joins this list and makes it a non-unicode string, thus
      resulting in the more readable (and ~10% shorter):
      
      Thu Feb  3 05:16:04 2011 snapshot/0 failed to send data: Exited with status 1 (recent output:   DUMP: Date of this level 0 dump: Thu Feb  3 05:15:58 2011\n  DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output\n  DUMP: Label: none\n  DUMP: Writing 10 Kilobyte records\n  DUMP: mapping (Pass I) [regular files]\n  DUMP: mapping (Pass II) [directories]\n  DUMP: estimated 54350 blocks.\n  DUMP: Volume 1 started with block 1 at: Thu Feb  3 05:15:59 2011\n  DUMP: dumping (Pass III) [directories]\nsocat: E SSL_write(): Connection reset by peer\ndd: dd: writing `standard output': Broken pipe\n  DUMP: Broken pipe\n  DUMP: The ENTIRE dump is aborted.)
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarMichael Hanselmann <hansmi@google.com>
      c9300bb3
    • Iustin Pop's avatar
      Fix handling of ^C in the CLI scripts · 8a53b55f
      Iustin Pop authored
      
      This adds a message and nice handling of ^C, especially useful for
      ``gnt-job watch``.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarGuido Trotter <ultrotter@google.com>
      8a53b55f
    • Michael Hanselmann's avatar
      backend: Disable compression in export info file · 775b8743
      Michael Hanselmann authored
      
      The new import/export infrastructure in Ganeti 2.2 and up handles
      compression differently. It no longer writes compressed files to the
      destination. Unfortunately changing this behaviour would be non-trivial,
      so in the meantime setting “compression = none” will hopefully avoid
      some confusion.
      
      Signed-off-by: default avatarMichael Hanselmann <hansmi@google.com>
      Reviewed-by: default avatarIustin Pop <iustin@google.com>
      775b8743
  20. Feb 02, 2011
  21. Feb 01, 2011
Loading