Commits · 74e89a1432ee618dc13d64f8f4b1a52fb11dab71 · itminedu / snf-ganeti

Sep 02, 2010

Fix ReplaceSecondary moves for offline nodes · 74e89a14

Iustin Pop authored 14 years ago

The addition of a new secondary on a node is doing two memory tests:
- in strict mode, reject if we get into N+1 failure
- reject if the new instance memory is greater than the free memory (not
  available memory) on the node

The last check is designed to ensure that, irrespective of the other
secondary instances on this node, we are able to failover/migrate the
newly-added instance.

However, we should allow this, if the instances comes from an offline
node, which doesn't offer anything (not even disk replication).
Therefore this patch makes this check conditional on the strict mode.

74e89a14

Update NEWS file · 49d977db
Iustin Pop authored 14 years ago

49d977db

Aug 30, 2010

Update man pages for the new -S option · db43d7b3
Iustin Pop authored 14 years ago

db43d7b3
hspace: mark new instances as running · 10852adb
Iustin Pop authored 14 years ago
```
Otherwise the saved cluster state and the in-memory one are wrong.
```
10852adb

Implement cluster state saving in hspace · 3e9501d0

Iustin Pop authored 14 years ago

This also uncovered a few issues with the allocation model (instances
not being marked up, etc.).

Compared to hbal, hspace will generate either one or two files (for both
the standard and the tiered allocation mode), depending on the input
parameters.

3e9501d0

Change iterateAlloc to return the instance list · 94d08202

Iustin Pop authored 14 years ago

The Cluster.iterateAlloc and tieredAlloc functions are changed to also
return the updated instance list, since it is needed to have a “full”
cluster view.

94d08202

Implement cluster state saving in hbal · 748654f7

Iustin Pop authored 14 years ago

Also move the LUXI execution (-X) to the end, after all the output
messages are printed. No good in waiting for the messages for a long
while, especially as they are not up-to-date stats after the job
execution, just an estimation of what the state will be.

748654f7

Abstract the cluster serialization from hscan.hs · 4a273e97

Iustin Pop authored 14 years ago

This is currently hardcoded in an internal function in hscan.hs, and we
move it to Text.hs for later use.

4a273e97

Aug 25, 2010

Add a new option --save-cluster · 02da9d07

Iustin Pop authored 14 years ago

This option will in the future be used to serialize the cluster state in
hbal and hspace after the rebalance/allocation steps.

02da9d07

Add unittest for Node text serialization · 50811e2c

Iustin Pop authored 14 years ago

This checks that the Node text serialization and deserialization
operations are idempotent when combined other.

50811e2c

Switch unittest to custom hostnames · a070c426

Iustin Pop authored 14 years ago

Currently, the hostnames are almost fully arbitrary chars, which breaks
the assumption that nodes/instances will be normal DNS hostnames.

This patch adds some custom generators for these hostnames, that will
allow better testing of text loader serialization/deserialization.

a070c426

Aug 24, 2010
- Move text serialization functions to Text.hs · 3bf75b7d
  Iustin Pop authored 14 years ago
```
Currently these are in hscan, and cannot be reused easily.
```
  3bf75b7d
Jul 29, 2010
- Fix a couple of typos in the manpages · 57ef88df
  Iustin Pop authored 14 years ago
```
Again, thanks to lintian.
```
  57ef88df
Jul 27, 2010
- hail: fix error message for failed multi-evac · 0ca66853
  Iustin Pop authored 14 years ago
```
Currently we show the instance index, but this makes no sense outside
the current running program. Instead, we show the instance name.
```
  0ca66853
- Update NEWS file for the 0.2.6 release · 84edb64b
  Iustin Pop authored 14 years ago
  
  htools-v0.2.6
  
  84edb64b
- NEWS: Add double blank lines before headers · 303bb0ed
  Iustin Pop authored 14 years ago
```
This looks better for text-only viewing…
```
  303bb0ed
Jul 23, 2010
- hscan: return exit code 2 for RAPI failures · f688711c
  Iustin Pop authored 14 years ago
```
If some clusters failed during RAPI collection, exit with exit code 2 so
that tests can detect this failure.
```
  f688711c
- More enhancements to live-test.sh · b7478ce1
  Iustin Pop authored 14 years ago
  
  b7478ce1
Jul 22, 2010
- Fix another haddock issue · b8262965
  Iustin Pop authored 14 years ago
  
  b8262965
- Remove an obsolete function and add Utils tests · 691dcd2a
  Iustin Pop authored 14 years ago
  
  691dcd2a
- Extend the live-test · b880f1d1
  Iustin Pop authored 14 years ago
```
The (recently-enabled) live test coverage stats found a few low-hanging
fruits in the tests we do…
```
  b880f1d1
Jul 21, 2010

Use --union for hpc sum · 7e9e8245

Iustin Pop authored 14 years ago

… which fixes the issue noted in the previous commit (almost a brown
paper bag change).

7e9e8245

Preliminary support for coverage during live-test · dc61c50b

Iustin Pop authored 14 years ago

While this doesn't work correctly yet (hpc sum seems to only take common
modules, not the sum of modules?), it prepares for gathering coverage
data during live-test (as an alternative to unittest coverage data).

dc61c50b

Add some more imports to QC.hs · 223dbe53

Iustin Pop authored 14 years ago

This is needed so that in the coverage report we list all modules, even
the ones we don't test at all, such that we get the complete results.

223dbe53

Change the meaning of the N+1 fail metric · c3c7a0c1

Iustin Pop authored 14 years ago

Currently, this metric tracks the nodes failing the N+1 check. While
this helps (in some cases) to evacuate such nodes, it's not a good
metric since rarely it will change during a step (only at the last
instance moving away). Therefore we replace it with the count of
instances living on such nodes, which is much better because:
- moving an instance away while the node is still N+1 failing will still
  reflect in the score as an optimization
- moving the last instance causing an N+1 failure will result in a heavy
  decrease of this score, thus giving the right bonus to clear this
  status

c3c7a0c1

Introduce per-metric weights · 8a3b30ca

Iustin Pop authored 14 years ago

Currently all metrics have the same weight (we just sum them together).
However, for the hard constraints (N+1 failures, offline nodes, etc.)
we should handle the metrics differently based on their meaning. For
example, an instance living on a primary offline node is worse than an
instance having its secondary node offline, which in turn is worse than
an instance having its secondary node failing N+1.

To express this case in our code, we introduce a table of weights for
the metrics, with which we can influence their relative importance.

8a3b30ca

Allow balancing moves to introduce N+1 errors · 2cae47e9

Iustin Pop authored 14 years ago

This patch switches the applyMove function to the extended versions of
Node.addPri and addSec, and passes the override flag based on the state
of the node that we're moving away from.

2cae47e9

Introduce a relaxed add instance mode · 3e3c9393

Iustin Pop authored 14 years ago

In case an instance is living on an offline node, it doesn't make sense
to refuse moving it because that would create N+1 failures; failing N+1
is still much better than not running at all. Similarly, if the
secondary node of an instance is offline, meaning the instance doesn't
have any redundancy, we have a worse case than having a secondary that
is N+1 failing and it could not accept the instance as primary, but it
stil does redundancy for it.

To allow this, we rename Node.addPri to addPriEx and introduce an extra
parameter (addPri is a partial application of addPriEx and keeps the
same signature). Node.addSec gets the same treatement.

3e3c9393

Jul 19, 2010

Remove obsolete Container.maxNameLen · 2849670b

Iustin Pop authored 14 years ago

This was only used in one place (hbal), and is obsolete by the change to
the dual name/alias structure.

2849670b

hbal: print short names in steps list · 14c972c7

Iustin Pop authored 14 years ago

This was a regression from the name handling changes, as we started
using the original names for the solution list (which is not designed
for parsing/feeding back into ganeti).

14c972c7

Remove an obsolete function · fb33aaaf
Iustin Pop authored 14 years ago
```
printSolution is no longer used, as we print the solution iteratively
now.
```
fb33aaaf

Jul 18, 2010

Allow '+' in node list fields · 6dfa04fd

Iustin Pop authored 14 years ago

When the field list is prefixed with a plus sign, this will extend the
default field list, instead of replacing it entirely.

6dfa04fd

Update the node list fields · 16f08e82

Iustin Pop authored 14 years ago

This patch renames the pri/sec to pcnt/scnt, and adds the real primary
and secondary instance lists, the peermap and the index of a node as
selectable options.

16f08e82

Cleanup a node's peer map when possible · 124b7cd7

Iustin Pop authored 14 years ago

If the last secondary instance of a peer is deleted (detected by the new
peer memory value being equal to zero), then the pair (pdx, 0) should be
deleted completely. This is not optimization per se, but rather cleanup
(the speedup is at most a percent, and only in some corner cases).

124b7cd7

Jul 16, 2010

Fix handling of offline options and short names · f9acea10

Iustin Pop authored 14 years ago


This needs to be abstracted in a separate function, but in the meantime
we fix the issue in both places.

Signed-off-by: Iustin Pop <iustin@google.com>

f9acea10

Jun 21, 2010
- Fix another haddock special-char issue · 95446d7a
  Iustin Pop authored 14 years ago
  
  95446d7a
- Remove JOB_STATUS_GONE and add unittests · db079755
  Iustin Pop authored 14 years ago
```
… for the serialization/deserialization of the job and opcode status.

Job status 'gone' was not actually used. It can be reintroduced if
needed.
```
  db079755
- Add opcode status constants/type · 41065165
  Iustin Pop authored 14 years ago
```
This mirrors, again, the Ganeti constats, and are added for future use.
```
  41065165
- Rename the job status constants · 7e98f782
  Iustin Pop authored 14 years ago
```
The rename is done such that we match Ganeti's own constants.
```
  7e98f782
Jun 08, 2010

Optimise the Luxi.recvMsg function · 95f490de

Iustin Pop authored 15 years ago

Since the current buffer cannot contain (during network reads) an EOM,
we should look for the EOM only in the newly-received string.  While
this shouldn't make much difference, in some tests it cuts the recvMsg
total time by around half.

On entering recvMsg, we have though to search the old buffer for a
message though, since we could have received two Luxi messages on the
last network query; this is however a one-off cost, compared to
continuously looking for the EOM in the old string (at each receive
loop).

95f490de