Commits · 4a8b186acb4e26ee6ea0fa4246d708411e4a772a · itminedu / snf-ganeti

Oct 01, 2008

Fix the watcher with down nodes · 37b77b18

Iustin Pop authored 16 years ago

The watcher didn't handle the down nodes, fix this by ignoring (in
secondary node reboot checks) any node that doesn't return a boot id.

Reviewed-by: imsnah

37b77b18

Fix the watcher not restarting instance bug · b7309a0d

Iustin Pop authored 16 years ago

The watcher was using conflicting attributes of the instance:
  - it queried the admin_/oper_state, which are booleans
  - but it compared those to the status (which is a text field)

The code was changed to query the aggregated 'status' field, as that
will also return indication of node problems, and we can use this only
one field for all decisions. We still ask for the admin_state field as
that is needed for the activate disks check (in secondary node restart).

The patch also touches the watcher in some other parts:
  - log exceptions nicer
  - convert a method to @staticmethod
  - remove unused imports

Reviewed-by: imsnah

b7309a0d

Remove last use of utils.RunCmd from the watcher · 5188ab37

Iustin Pop authored 16 years ago

The watcher has one last use of ganeti commands as opposed to sending
requests via luxi. The patch changes this to use the cli functions.

The patch also has two other changes:
  - fix the docstring for OpVerifyDisks (found out while converting
    this)
  - enable stderr logging on the watcher when “-d” is passes

Reviewed-by: imsnah

5188ab37

Sep 09, 2008

ganeti-noded: Add constant for queue lock timeout · 8785cb30
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
8785cb30

Implement master startup safety check · 36205981

Iustin Pop authored 16 years ago

This is an initial version of the master startup checks. It's a very
rudimentary change, however in normal usage (an old master was started,
the rest of the cluster is functioning normally) it will succeed in
preventing wrong startups.

Reviewed-by: imsnah

36205981

Export backend.GetMasterInfo over the rpc layer · 4e071d3b

Iustin Pop authored 16 years ago

We create a multi-node call so that querying all nodes for agreement
will be fast.

Reviewed-by: imsnah

4e071d3b

Use lock timeout for queue updates in ganeti-noded · 506cff12
Michael Hanselmann authored 16 years ago
```
This helps to prevent complete deadlocks.

Reviewed-by: iustinp
```
506cff12

Sep 05, 2008
- noded: Get job queue lock while purging queue content · f1f3f45c
  Michael Hanselmann authored 16 years ago
```
Only one process should modify the queue at the same time.

Reviewed-by: iustinp
```
  f1f3f45c
Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 27, 2008

Make sure that client programs get all messages · 6c5a7090

Michael Hanselmann authored 16 years ago

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added
too fast, thereby making commands like “gnt-cluster verify” less than
useful. These changes now introduce some sort a serial number per
log entry to keep track what message a client already received. They
also remove the log lock per opcode to make reading log entries thread
safe.

Reviewed-by: ultrotter

6c5a7090

Aug 18, 2008

Use Linux-specific way to name master socket · 9894ece7

Michael Hanselmann authored 16 years ago

By using this Linux-specific way we don't have to care about removing the
socket file when quitting or starting (after an unclean shutdown). For a
more detailed description, see the comment in the patch.

Reviewed-by: schreiberal

9894ece7

Aug 11, 2008

Add RPC call to wait for job changes · dfe57c22

Michael Hanselmann authored 16 years ago

This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter

dfe57c22

Aug 08, 2008
- Add query function for exports · 32f93223
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  32f93223
- noded: Add RPC function to rename job queue files · af5ebcb1
  Michael Hanselmann authored 16 years ago
```
This will be used to archive jobs.

Reviewed-by: iustinp
```
  af5ebcb1
- noded: Add decorator for job queue lock · 7f30777b
  Michael Hanselmann authored 16 years ago
```
The lock will also be needed by another function.

Reviewed-by: iustinp
```
  7f30777b
- Implement queue locking in node daemon · 25d6d12a
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  25d6d12a
- More logging for errors during noded RPC calls · aa9075c5
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  aa9075c5
- Add job queue RPC functions · ca52cdeb
  Michael Hanselmann authored 16 years ago
```
jobqueue_update: Uploads a job queue file's content to a node. The
most common operation is to upload something that we already have
in a string. Unlike in the upload_file function, the file is not
read again when distributing changes, but content has to be passed
as a string.

jobqueue_purge: Removes all queue related files from a node.

Reviewed-by: iustinp
```
  ca52cdeb
Aug 07, 2008
- Use API instead of command line utilities in watcher · e125c67c
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  e125c67c
Aug 06, 2008

Notify job queue about added/removed nodes · c36176cc

Michael Hanselmann authored 16 years ago

The job queue maintains its own node list and must be notified
when nodes are added/removed.

Reviewed-by: iustinp

c36176cc

Implement {Add,Readd,Remove}Node in GanetiContext · d8470559

Michael Hanselmann authored 16 years ago

By doing this we've a central place which coordinates what needs to be
done when adding or removing nodes. Another patch will add calls into
the job queue.

Two log messages move to config.py.

When removing a node, node_leave_cluster is now called after it has
been removed from the configuration and job manager. That way we're
sure not to access the node again after files have been removed.

Reviewed-by: iustinp

d8470559

jqueue: Don't pass the list of nodes to SubmitJob anymore · 4c848b18

Michael Hanselmann authored 16 years ago

The job queue now maintains its own list and is updated when
nodes are added or removed from the cluster.

Reviewed-by: iustinp

4c848b18

masterd: Move job queue into context object · 9113300d

Michael Hanselmann authored 16 years ago

The job queue must be called from cmdlib when adding or removing
nodes to the cluster. Moving it to the context objects makes
this possible.

Reviewed-by: iustinp

9113300d

Implement query for nodes · 02f7fe54
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
02f7fe54

Implement query for instances · ee6c7b94

Michael Hanselmann authored 16 years ago

Queries don't create jobs and are more efficient. Log messages
are not yet stored anywhere.

Reviewed-by: iustinp

ee6c7b94

Jul 31, 2008

First write operation (add tag) for Ganeti RAPI · 441e7cfd

Oleksiy Mishchenko authored 16 years ago

Add instance tag handling, improved error logging.
...oh, yes adopt instance listing for RAPI2!

Reviewed-by: iustinp

441e7cfd

Jul 30, 2008

Unify SetupDaemon/SetupLogging · 59f187eb

Iustin Pop authored 16 years ago

The 'old-style' info, error, debug logs do not make much sense. This
patch unifies the SetupLogging and SetupDaemon functions. As a result,
all the commands logs to a 'commands.log' file.

The patch also changes the log setup to keep going if there's an error
in setting up the file logging but we're logging to stderr.

Also, burnin now logs to its own file (burnin.log).

Reviewed-by: ultrotter

59f187eb

Rework master startup/shutdown/failover · b1b6ea87

Iustin Pop authored 16 years ago

This (big) patch reworks the master startup/shutdown and the fixes the
master failover.

What does the patch do?

For master start/stop:
  - remove the old ganeti-master script and its associated man page
  - moves the ip start/stop directly into the backend.(Start|Stop)Master
  - adds start/stop of the master/rapi daemon into these functions,
    selectively based on the start/stop arguments
  - makes the master call via rpc StartMaster(start_daemons=False) to
    the local node so that the master IP is started
  - and finally changes the example init.d script to directly start and
    stop all three daemons, since they do the right thing (depending on
    master/not master role)

For master failover:
  - moves the code from LUMasterFailover into bootstrap.MasterFailover,
    since we need to start/stop the master during this operation and
    thus it can't be executed from the master
  - removes the LUMasterFailover and its associated opcode

Notes: ubuntu's /etc/lsb-base-logging.sh is dumb, so the messages 'not
master' are not seen during startup on non-master nodes.

Reviewed-by: ultrotter

b1b6ea87

Implement checking for the master role in rapi · 5675cd1f

Iustin Pop authored 16 years ago

This patch moves the CheckMaster function from ganeti-masterd to ssconf
(most logical place, it cannot go in utils since we would have recursive
imports between ssconf and utils) and changes ganeti-rapi to also call
this function.

This is needed so that starting ganeti-rapi on a non-master node does
the right thing.

Reviewed-by: ultrotter

5675cd1f

Add a new parameter to backend.(Start|Stop)Master · 1c65840b

Iustin Pop authored 16 years ago

This patch adds a new, unused for now, parameter to the start and stop
master operations in backend. The idea behind it is that we need to be
able to control whether the IP (de)activation is coupled with daemon
startup/shutdown.

The callers are also modified to pass this parameter (even if unused for
now).

Reviewed-by: ultrotter

1c65840b

Jul 29, 2008
- Use constants for the pid file stems · 99e88451
  Iustin Pop authored 16 years ago
```
Reviewed-by: imsnah
```
  99e88451
- Make the rapi daemon create a pidfile · f71245a0
  Iustin Pop authored 16 years ago
```
This is needed for controlling it cleanly with start-stop daemon.

Reviewed-by: ultrotter
```
  f71245a0
Jul 28, 2008

Implement signal handling in ganeti-rapi · cfe3c70f
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
cfe3c70f

Move ganeti-rapi core code to daemon · 3cd62121

Michael Hanselmann authored 16 years ago

All other daemons have their main code in themselves and not in a module.
This patch does the same to ganeti-rapi by moving the code from
lib/rapi/RESTHTTPServer.py to daemons/ganeti-rapi.

Reviewed-by: iustinp

3cd62121

Jul 24, 2008
- Fix RPC parameters for {Cancel,Archive}Job · 3a2c7775
  Michael Hanselmann authored 16 years ago
```
They aren't be tuples on the client side.

Reviewed-by: iustinp
```
  3a2c7775
Jul 23, 2008

ganeti-masterd: write and remove pidfile · 8feda3ad
Guido Trotter authored 16 years ago
```
Reviewed-by: iustinp
```
8feda3ad
ganeti-noded: write and remove pid file · 73d927a2
Guido Trotter authored 16 years ago
```
Reviewed-by: iustinp
```
73d927a2

Distribute the queue serial file after each update · c3f0a12f

Iustin Pop authored 16 years ago

This patch adds distribution of the queue serial file after each write
to it (but before a new job is created and written with that ID, and
before a response is returned, so we should be safe from crashes in
between).

Currently it only logs if a node cannot be contacted, it should abort if
> 50% errors are seen.

Reviewed-by: imsnah

c3f0a12f

Jul 21, 2008
- Handle signals in node daemon · 84b58db2
  Michael Hanselmann authored 16 years ago
```
This also fixes a TODO added by ultrotter by killing the parent
process when QuitGanetiException is raised.

Reviewed-by: ultrotter
```
  84b58db2
- Use new signal handler class in master daemon · 610bc9ee
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: ultrotter
```
  610bc9ee