Skip to content
Snippets Groups Projects
  1. Dec 23, 2010
    • Iustin Pop's avatar
      Change the balancing function · 4715711d
      Iustin Pop authored
      
      Currently the balancing function is a modified version of the standard
      deviation (stddev divided by list length), due to historical reasons.
      
      While this works fine for small clusters, for big clusters it makes
      the balancing effect too "weak", and in some cases it refuses to
      balance correctly some clusters. It also makes the balancing behaviour
      dependant on the cluster size, which is a big no-no.
      
      Therefore we revert to the normal version of standard deviation, and
      we also rename the function to reflect what it does. The new version
      correctly balances some corner cases that the previous version didn't,
      and passes the current balancing unittests.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarAdeodato Simo <dato@google.com>
      4715711d
  2. Nov 23, 2010
    • Iustin Pop's avatar
      Improve the standard deviation computation · 7570569e
      Iustin Pop authored
      
      This does just two passes, instead of three, over the list. This reduces
      the overall runtime well enough (~25%) in some tests, but it's not
      reproducible using profiling, so I don't know how much the function
      itself is being sped-up.
      
      Note: this is written via `seq`s, and not BangPatterns. Since it's just
      one case, adding BangPatterns just for it wasn't a big gain.
      
      Thanks to Lécz Balázs for the impetus to improve this!
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarBalazs Lecz <leczb@google.com>
      7570569e
    • Iustin Pop's avatar
      hbal: change handling of signal · 543e859d
      Iustin Pop authored
      
      Currently, hbal does a one-two signal handling, where the first signal
      causes graceful termination, and the second one an immediate on (either
      SIGINT or SIGTERM can be used, interchangeably). However, this poses a
      timing problem: if two programs want to send a graceful termination
      request, they cannot do that without careful coordination.
      
      To fix this, we change to code to handle the signal separately: SIGINT
      (^C) sends graceful termination, while SIGTERM sends immediate
      termination. This should allow easier controlling of hbal.
      
      Signed-off-by: default avatarIustin Pop <iustin@google.com>
      Reviewed-by: default avatarBalazs Lecz <leczb@google.com>
      543e859d
  3. Nov 09, 2010
    • Iustin Pop's avatar
      Fix tag exclusion weight · 306cccd5
      Iustin Pop authored
      Currently, the tag exclusion metric has a weight of one, which means
      there might be cases where we won't move instances around because it
      upsets the cluster metrics. However, we do want to make a higher effort
      for cleaning up tag collisions, so we increase the weight to an
      empirically-determined value of 2.
      306cccd5
  4. Oct 21, 2010
  5. Oct 07, 2010
  6. Oct 06, 2010
  7. Sep 15, 2010
    • Iustin Pop's avatar
      hbal: implement user-friendly termination requests · 03cb89f0
      Iustin Pop authored
      Currently, hbal will abort immediately when requested (^C, or SIGINT,
      etc.). This is not nice, since then the already started jobs need to be
      tracked manually.
      
      This patch adds a signal handler for SIGINT and SIGTERM, which will, the
      first time, simply record the shutdown request (and hbal will then exit
      once all jobs in the current jobset finish), and at the second request,
      will cause an immediate exit.
      03cb89f0
  8. Sep 03, 2010
  9. Sep 02, 2010
    • Iustin Pop's avatar
      Makefile: make the rst2html converter more strict · d78ceb9e
      Iustin Pop authored
      This will make the automated builds flag any problems.
      d78ceb9e
    • Iustin Pop's avatar
      Add some more debugging functions · adc5c176
      Iustin Pop authored
      These are just variations of the standard debug, but are provided for
      simpler code, since lazyness is something causing non-computation of
      debug statements.
      adc5c176
    • Iustin Pop's avatar
      Fix ReplaceSecondary moves for offline nodes · 74e89a14
      Iustin Pop authored
      The addition of a new secondary on a node is doing two memory tests:
      - in strict mode, reject if we get into N+1 failure
      - reject if the new instance memory is greater than the free memory (not
        available memory) on the node
      
      The last check is designed to ensure that, irrespective of the other
      secondary instances on this node, we are able to failover/migrate the
      newly-added instance.
      
      However, we should allow this, if the instances comes from an offline
      node, which doesn't offer anything (not even disk replication).
      Therefore this patch makes this check conditional on the strict mode.
      74e89a14
    • Iustin Pop's avatar
      Update NEWS file · 49d977db
      Iustin Pop authored
      49d977db
  10. Aug 30, 2010
  11. Aug 25, 2010
    • Iustin Pop's avatar
      Add a new option --save-cluster · 02da9d07
      Iustin Pop authored
      This option will in the future be used to serialize the cluster state in
      hbal and hspace after the rebalance/allocation steps.
      02da9d07
    • Iustin Pop's avatar
      Add unittest for Node text serialization · 50811e2c
      Iustin Pop authored
      This checks that the Node text serialization and deserialization
      operations are idempotent when combined other.
      50811e2c
    • Iustin Pop's avatar
      Switch unittest to custom hostnames · a070c426
      Iustin Pop authored
      Currently, the hostnames are almost fully arbitrary chars, which breaks
      the assumption that nodes/instances will be normal DNS hostnames.
      
      This patch adds some custom generators for these hostnames, that will
      allow better testing of text loader serialization/deserialization.
      a070c426
  12. Aug 24, 2010
  13. Jul 29, 2010
  14. Jul 27, 2010
  15. Jul 23, 2010
  16. Jul 22, 2010
  17. Jul 21, 2010
    • Iustin Pop's avatar
      Use --union for hpc sum · 7e9e8245
      Iustin Pop authored
      … which fixes the issue noted in the previous commit (almost a brown
      paper bag change).
      7e9e8245
    • Iustin Pop's avatar
      Preliminary support for coverage during live-test · dc61c50b
      Iustin Pop authored
      While this doesn't work correctly yet (hpc sum seems to only take common
      modules, not the sum of modules?), it prepares for gathering coverage
      data during live-test (as an alternative to unittest coverage data).
      dc61c50b
    • Iustin Pop's avatar
      Add some more imports to QC.hs · 223dbe53
      Iustin Pop authored
      This is needed so that in the coverage report we list all modules, even
      the ones we don't test at all, such that we get the complete results.
      223dbe53
    • Iustin Pop's avatar
      Change the meaning of the N+1 fail metric · c3c7a0c1
      Iustin Pop authored
      Currently, this metric tracks the nodes failing the N+1 check. While
      this helps (in some cases) to evacuate such nodes, it's not a good
      metric since rarely it will change during a step (only at the last
      instance moving away). Therefore we replace it with the count of
      instances living on such nodes, which is much better because:
      - moving an instance away while the node is still N+1 failing will still
        reflect in the score as an optimization
      - moving the last instance causing an N+1 failure will result in a heavy
        decrease of this score, thus giving the right bonus to clear this
        status
      c3c7a0c1
Loading