1. 02 Sep, 2010 2 commits
    • Iustin Pop's avatar
      Fix ReplaceSecondary moves for offline nodes · 74e89a14
      Iustin Pop authored
      The addition of a new secondary on a node is doing two memory tests:
      - in strict mode, reject if we get into N+1 failure
      - reject if the new instance memory is greater than the free memory (not
        available memory) on the node
      
      The last check is designed to ensure that, irrespective of the other
      secondary instances on this node, we are able to failover/migrate the
      newly-added instance.
      
      However, we should allow this, if the instances comes from an offline
      node, which doesn't offer anything (not even disk replication).
      Therefore this patch makes this check conditional on the strict mode.
      74e89a14
    • Iustin Pop's avatar
      Update NEWS file · 49d977db
      Iustin Pop authored
      49d977db
  2. 30 Aug, 2010 6 commits
  3. 25 Aug, 2010 3 commits
    • Iustin Pop's avatar
      Add a new option --save-cluster · 02da9d07
      Iustin Pop authored
      This option will in the future be used to serialize the cluster state in
      hbal and hspace after the rebalance/allocation steps.
      02da9d07
    • Iustin Pop's avatar
      Add unittest for Node text serialization · 50811e2c
      Iustin Pop authored
      This checks that the Node text serialization and deserialization
      operations are idempotent when combined other.
      50811e2c
    • Iustin Pop's avatar
      Switch unittest to custom hostnames · a070c426
      Iustin Pop authored
      Currently, the hostnames are almost fully arbitrary chars, which breaks
      the assumption that nodes/instances will be normal DNS hostnames.
      
      This patch adds some custom generators for these hostnames, that will
      allow better testing of text loader serialization/deserialization.
      a070c426
  4. 24 Aug, 2010 1 commit
  5. 29 Jul, 2010 1 commit
  6. 27 Jul, 2010 3 commits
  7. 23 Jul, 2010 2 commits
  8. 22 Jul, 2010 3 commits
  9. 21 Jul, 2010 7 commits
    • Iustin Pop's avatar
      Use --union for hpc sum · 7e9e8245
      Iustin Pop authored
      … which fixes the issue noted in the previous commit (almost a brown
      paper bag change).
      7e9e8245
    • Iustin Pop's avatar
      Preliminary support for coverage during live-test · dc61c50b
      Iustin Pop authored
      While this doesn't work correctly yet (hpc sum seems to only take common
      modules, not the sum of modules?), it prepares for gathering coverage
      data during live-test (as an alternative to unittest coverage data).
      dc61c50b
    • Iustin Pop's avatar
      Add some more imports to QC.hs · 223dbe53
      Iustin Pop authored
      This is needed so that in the coverage report we list all modules, even
      the ones we don't test at all, such that we get the complete results.
      223dbe53
    • Iustin Pop's avatar
      Change the meaning of the N+1 fail metric · c3c7a0c1
      Iustin Pop authored
      Currently, this metric tracks the nodes failing the N+1 check. While
      this helps (in some cases) to evacuate such nodes, it's not a good
      metric since rarely it will change during a step (only at the last
      instance moving away). Therefore we replace it with the count of
      instances living on such nodes, which is much better because:
      - moving an instance away while the node is still N+1 failing will still
        reflect in the score as an optimization
      - moving the last instance causing an N+1 failure will result in a heavy
        decrease of this score, thus giving the right bonus to clear this
        status
      c3c7a0c1
    • Iustin Pop's avatar
      Introduce per-metric weights · 8a3b30ca
      Iustin Pop authored
      Currently all metrics have the same weight (we just sum them together).
      However, for the hard constraints (N+1 failures, offline nodes, etc.)
      we should handle the metrics differently based on their meaning. For
      example, an instance living on a primary offline node is worse than an
      instance having its secondary node offline, which in turn is worse than
      an instance having its secondary node failing N+1.
      
      To express this case in our code, we introduce a table of weights for
      the metrics, with which we can influence their relative importance.
      8a3b30ca
    • Iustin Pop's avatar
      Allow balancing moves to introduce N+1 errors · 2cae47e9
      Iustin Pop authored
      This patch switches the applyMove function to the extended versions of
      Node.addPri and addSec, and passes the override flag based on the state
      of the node that we're moving away from.
      2cae47e9
    • Iustin Pop's avatar
      Introduce a relaxed add instance mode · 3e3c9393
      Iustin Pop authored
      In case an instance is living on an offline node, it doesn't make sense
      to refuse moving it because that would create N+1 failures; failing N+1
      is still much better than not running at all. Similarly, if the
      secondary node of an instance is offline, meaning the instance doesn't
      have any redundancy, we have a worse case than having a secondary that
      is N+1 failing and it could not accept the instance as primary, but it
      stil does redundancy for it.
      
      To allow this, we rename Node.addPri to addPriEx and introduce an extra
      parameter (addPri is a partial application of addPriEx and keeps the
      same signature). Node.addSec gets the same treatement.
      3e3c9393
  10. 19 Jul, 2010 3 commits
  11. 18 Jul, 2010 3 commits
    • Iustin Pop's avatar
      Allow '+' in node list fields · 6dfa04fd
      Iustin Pop authored
      When the field list is prefixed with a plus sign, this will extend the
      default field list, instead of replacing it entirely.
      6dfa04fd
    • Iustin Pop's avatar
      Update the node list fields · 16f08e82
      Iustin Pop authored
      This patch renames the pri/sec to pcnt/scnt, and adds the real primary
      and secondary instance lists, the peermap and the index of a node as
      selectable options.
      16f08e82
    • Iustin Pop's avatar
      Cleanup a node's peer map when possible · 124b7cd7
      Iustin Pop authored
      If the last secondary instance of a peer is deleted (detected by the new
      peer memory value being equal to zero), then the pair (pdx, 0) should be
      deleted completely. This is not optimization per se, but rather cleanup
      (the speedup is at most a percent, and only in some corner cases).
      124b7cd7
  12. 16 Jul, 2010 1 commit
  13. 21 Jun, 2010 4 commits
  14. 08 Jun, 2010 1 commit
    • Iustin Pop's avatar
      Optimise the Luxi.recvMsg function · 95f490de
      Iustin Pop authored
      Since the current buffer cannot contain (during network reads) an EOM,
      we should look for the EOM only in the newly-received string.  While
      this shouldn't make much difference, in some tests it cuts the recvMsg
      total time by around half.
      
      On entering recvMsg, we have though to search the old buffer for a
      message though, since we could have received two Luxi messages on the
      last network query; this is however a one-off cost, compared to
      continuously looking for the EOM in the old string (at each receive
      loop).
      95f490de