Commits · 405241dc17d6d9b0663cea14cebd1e691738b58b · itminedu / snf-ganeti

Jul 30, 2010

Test instance NIC and disk parameter names · 405241dc

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

405241dc

Add new parameter type “maybe string” · 59525e1f

Michael Hanselmann authored 14 years ago


Before strict checking was implemented, NIC IP addresses could be set
to “None”. Commit bd061c35 added more strict checking, including
enforcing the IP address to be a string. With this new type, it
can again be set to None.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

59525e1f

cmdlib: Change expected type for source CA on remote import · 71ca6309

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

71ca6309

move-instance: Pass OS parameters to new instance · a59b0dc4

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a59b0dc4

Update NEWS file for the first release candidate · 529a8d17

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

529a8d17

Fix a few job archival issues · aa9f8167

Iustin Pop authored 14 years ago


This patch fixes two issues with job archival. First, the
LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add
None to the job list; we can't anyway, as this raises an exception:

  node1# gnt-job archive foo
  Unhandled protocol error while talking to the master daemon:
  Caught exception: cannot create weak reference to 'NoneType' object

After fixing this, job archival of missing jobs will just continue
silently, so we modify gnt-job archive to log jobs which were not
archived and to return exit code 1 for any missing jobs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

aa9f8167

Jul 29, 2010

burning: fix handling of empty job sets · 78bb78b1

Iustin Pop authored 14 years ago


If we call burning with only existing instance, then it will fail to
create any of them, and thus in the removal phase it won't have anything
to remove. Since calling luxi.SUBMIT_MULTIPLE_JOBS with an empty job set
is an error (and will raise an exception), this creates a very strange
error in burnin (which is unfortunately hidden by ExecJobSet()).

As such, we modify CommitQueue to return immediately if it has an empty
op queue.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

78bb78b1

Change semantics of --force-multi for reinstall · 297ddce9

Iustin Pop authored 14 years ago


Currently, we require both --force and --force-multiple for skipping the
confirmation on instance reinstalls. After offline conversations, this
has been deemed to be excessive, and this patch changes the meaning of
--force-multiple to be a “stronger” force, and not require both.

So, to skip the prompts:
- single instance reinstallation requires either --force or
  --force-multiple
- multiple instance reinstallation requires --force-multiple

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

297ddce9

Change handling of non-Ganeti errors in jqueue · 599ee321

Iustin Pop authored 14 years ago


Currently, if a job execution raises a Ganeti-specific error (i.e.
subclass of GenericError), then we encode it as (error class, [error
args]). This matches the RAPI documentation.

However, if we get a non-Ganeti error, then we encode it as simply
str(err), a single string. This means that the opresult field is not
according to the RAPI docs, and thus it's hard to reliably parse the
job results.

This patch changes the encoding of a failed job (via failure) to always
be an OpExecError, so that we always encode it properly. For the command
line interface, the behaviour is the same, as any non-Ganeti errors get
re-encoded as OpExecError anyway. For the RAPI clients, it only means
that we always present the same type for results. The actual error value
is the same, since the err.args is either way str(original_error);
compare the original (doesn't contain the ValueError):

  "opresult": [
    "invalid literal for int(): aa"
  ],

with:

  "opresult": [
    [
      "OpExecError",
      [
        "invalid literal for int(): aa"
      ]
    ]
  ],

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

599ee321

Implement gnt-cluster master-ping · 4404ffad

Iustin Pop authored 14 years ago


This can be used from shell-scripts to quickly check the status of the
master node, before launching a series of jobs (and handling the failure
of the jobs due to masterd other issues).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

4404ffad

Instance migration: remove error on missing link · b8ebd37b

Iustin Pop authored 14 years ago


Since we don't support upgrades from 1.2.4 without restarting the
instance, the 'not restarted since 1.2.5' check/error is
wrong/misleading.

Since the live migration works anyway without the links (it recreates
them during the disk reconfiguration anyway), we remove the check and we
transform it into a warning (to the node daemon log only,
unfortunately).

For 2.3, we'll need to change the symlink creation from instance start
time to disk activation time (but that requires more RPC changes).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b8ebd37b

Add check for RAPI paths to start with /2 · 2c0be3d0

Michael Hanselmann authored 14 years ago


During a discussion in July 2010 it was decided that we'll stabilize on /2. See
message ID <20100716180012.GA9423@google.com> for reference.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2c0be3d0

Ensure assertions are evaluated in tests · a9b144cb

Michael Hanselmann authored 14 years ago


A lot of assertions are used in Ganeti's code. Some unittests even check
whether AssertionError is raised in some cases. Explicitely ensuring
assertions are evaluated makes sure those tests don't fail and
assertions are checked.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a9b144cb

RAPI client: The os argument for instance reinstalls is optional · fcee9675
David Knowles authored 14 years ago
```
Signed-off-by: David Knowles <dknowles@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
```
fcee9675

QA: Test instance migration via CLI and RAPI · 938bde86

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

938bde86

RAPI client: Support migrating instances · e0ac6ce6

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e0ac6ce6

RAPI: Support migrating instances · 5823e0d2

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5823e0d2

workerpool: Change signature of AddTask function to not use *args · b2e8a4d9

Michael Hanselmann authored 14 years ago


By changing it to a normal parameter, which must be a sequence, we can
start using keyword parameters.

Before this patch all arguments to “AddTask(self, *args)” were passed as
arguments to the worker's “RunTask” method. Priorities, which should be
optional and will be implemented in a future patch, must be passed as a keyword
parameter. This means “*args” can no longer be used as one can't combine *args
and keyword parameters in a clean way:

>>> def f(name=None, *args):
...   print "%r, %r" % (args, name)
...
>>> f("p1", "p2", "p3", name="thename")
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 TypeError: f() got multiple values for keyword argument 'name'

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

b2e8a4d9

workerpool: Add two additional assertions · c1cf1fe5

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

c1cf1fe5

workerpool: Additional check in BaseWorker.ShouldTerminate · 2f4e1516

Michael Hanselmann authored 14 years ago


Document that it should only be called from within RunTask and
add an assertion for this. This means we can no longer use a
method on the pool and hence remove WorkerPool.ShouldWorkerTerminate.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2f4e1516

workerpool: Remove unused worker method · ccedb11b

Michael Hanselmann authored 14 years ago


HasRunningTask is never used except for an assertion, where we
don't really need the lock.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

ccedb11b

workerpool: Move waiting for new tasks for a worker to the pool · 21c5ad52

Michael Hanselmann authored 14 years ago


This way fewer private variables of the pool are accesssed by the worker.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

21c5ad52

workerpool: Use common function to add tasks · 189d2714

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

189d2714

Fix install document regarding DRBD usage · 92c1ea55

Iustin Pop authored 14 years ago


This is related to issue 105.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

92c1ea55

Update RAPI documentation for the OS changes · b430a54d

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

b430a54d

Jul 28, 2010

Rename masterfailover to master-failover · c28502b1

Iustin Pop authored 14 years ago


Most (all?) of our commands use dash-separator: replace-disks,
verify-disks, add-tags, etc. “gnt-cluster masterfailover” is an old
exception to this rule.

The patch replaces it with master-failover, add a compatiblity alias,
and updates the documentation for this change.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

c28502b1

RAPI: Add os params to instance creation v1 · 130f0966

Iustin Pop authored 14 years ago


Since the RAPI QA suite doesn't seem to offer easy testing of failed
creations, I didn't add this to the QA. Pointers on how to do it are
welcome.

The patch also changes the 'os' argument to be required, since that is
how the LU expects it, and without it we just fail later instead of
directly at submission time.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

130f0966

makefile: fix TAGS building · 74fa8200

Iustin Pop authored 14 years ago


“find .” requires that “-path” arguments start with a dot, otherwise
they are not matches. Additionally, we also include the QA files in the
tags, for easier search while modifying the QA suite.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

74fa8200

Improve handling of lost jobs · 91c622a8

Iustin Pop authored 14 years ago


Currently, if the cli.JobExecutor class is being used, and one of the
jobs is being archived before it can check its result, it will raise a
stracktrace as _ChooseJob is not prepared to handle this case.

This case makes JobExecutor work better with lost jobs (it still reports
them as 'failed', but it doesn't break and returns a proper error
message), and modifies the generic FormatError to report the JobLost
exception properly, instead of as "Unhandled Ganeti Exception".

Since JobExecutor is hard to test properly, I only tested this manually,
via a fake invocation.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

91c622a8

luxi: convert permission errors into exception · 5a1c22fe

Iustin Pop authored 14 years ago


This patch adds handling of permission errors so that we don't show
tracebacks when a non-root user runs a gnt-* command. Since in the
future we'll have different permissions, we need to handle this in RAPI
too.

It also fixes a typo in RAPI error message and the docstrings of LUXI
errors.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5a1c22fe

cmdlib: Return new name from rename operations · 6a016df9

Michael Hanselmann authored 14 years ago


The new name is then displayed by the clients.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Manuel Franceschini <livewire@google.com>

6a016df9

gnt-instance rename: Fix bug and rename params · 3fe11ba3

Manuel Franceschini authored 14 years ago


This patch fixes a bug when gnt-instance rename was invoked with
--no-name-check. It renames the internal variables to be consistent with
the ones in equivalent instance add code. Furthermore it checks whether
and instance rename is invoked with --no-name-check but without
--no-ip-check and throws an exception if so.

Signed-off-by: Manuel Franceschini <livewire@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

3fe11ba3

Jul 26, 2010

QA: add tests for the reserved lvs feature · 452913ed

Iustin Pop authored 14 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

452913ed

Add modification of the reserved logical volumes · f38ea602

Iustin Pop authored 14 years ago


This doesn't allow addition/removal of individual volumes, only
wholesale replace of the entire list. It can be improved later, if we
ever get generic container parameters.

The man page changes replaces some tabs with spaces (hence the
whitespace changes).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f38ea602

Add printing of reserved_lvs in cluster info · 5a3ab484

Iustin Pop authored 15 years ago


Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

5a3ab484

Introuce a new cluster parameter - reserved_lvs · 999b183c

Iustin Pop authored 15 years ago


This parameter, which is a list of regular expression patterns, will
make cluster verify ignore any such LVs. It will not prevent creation or
removal of such volumes by the backend code.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

999b183c

Change the meaning of call_node_start_master · 91492e57

Iustin Pop authored 14 years ago


Currently, backend.StartMaster (the function behind this RPC call) will
activate the master IP and then, if the start_daemons parameter is true,
it will also activate the master role.

While this works, it has two issues:

- first, it will activate the master IP unconditionally, even if this
  node will not start the master daemon due to missing votes
- second, the activation of the IP is done twice if start_daemons is
  true, because the master daemon does its own activation too

This behaviour seems to be unmodified since Summer 2008, so probably any
rationale on why this is done in two places is forgotten.

The patch changes so that this function does *either* IP activation or
master role activation but not both. So the IP will be activated only
once (from the master daemon or from LURenameCluster), and it will only
be done if the masterd got enough votes for startup.

I can see only one downside to this change: if masterd won't actually
start (due to missing votes), RAPI will still start, and without the
master IP activated. But this is no worse than before, when both RAPI
was running and the IP was activated.

Note that the behaviour of StopMaster remains the same, as noone else
does the IP removal.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

91492e57

masterd: move the IP activation from Exec to Check · 340f4757

Iustin Pop authored 14 years ago


Currently, the master IP activation is done in the Exec function. Since
the original masterd process returns after forking, and Exec is run in
the (grand)child process, this means that after 'ganeti-masterd' has
returned there are still initialization tasks running.

Normally this is not a problem, but in cases where one does quick master
failovers, this creates a race condition which hits the QA scripts
especially hard.

To solve this, and make the startup process cleaner (the system is in
steady state after the command has returned, even though masterd startup
could still fail), we move the IP activation to Check(). This also
allows error messages about the IP activation to be seen on the console.

With this patch enabled, I can no longer reproduce the double-failover
errors, which were occuring before in 4/5 cases.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

340f4757

Move the UsesRPC decorator from cli to rpc · e0e916fe

Iustin Pop authored 14 years ago


This is needed because not just the cli scripts need this decorator, but
the master daemon too (and it already duplicated the code once).

In cli.py we just leave a stub, so that we don't have to modify all the
scripts to import rpc.py.

We then change the master daemon code to reuse this decorator, instead
of duplicating it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

e0e916fe

watcher: smarter handling of instance records · f5116c87

Iustin Pop authored 14 years ago

This patch implements a few changes to the instance handling. First, old
instances which no longer exist on the cluster are removed from the
state file, to keep things clean.

Second, the instance restart counters are reset every 8 hours, since
some error cases might be transient (e.g. networking issues, or machine
temporarily down), and if the problem takes more than 5 restarts but is
not permanent, watcher will not restart the instance. The value of 8
hours is, I think, both conservative (as not to hammer the cluster too
often with restarts) and fast enough to clear semi-transient problems.

And last, if an instance is not restarted due to exhausted retries, this
should be warned, otherwise it's hard to understand why watcher doesn't
want to restart an ERROR_down instance.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

f5116c87