Commits · 2503680f1525bc8e79f208eda005e321ad560a4f · itminedu / snf-ganeti

Jul 29, 2009

Extend call_node_start_master rpc with no_voting · 2503680f

Guido Trotter authored 15 years ago


When the parameter is set to True and start_daemons is also True,
ganeti-masterd will be started with the new --no-voting --yes-do-it
options.

This new option is set to True only on masterfailover, when no_voting is
used. This changed the behavior from 2.0, where we didn't start the
master daemon at all, when this option was used.

The manpage is also updated to remove the 2.0 only change.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2503680f

Jul 08, 2009

ganeti-masterd: allow non-interactive --no-voting · 5e96d216

Guido Trotter authored 15 years ago


This will be used by ganeti-noded to start ganeti-masterd in a
--no-voting masterfailover.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5e96d216

May 04, 2009

Fix luxi serialization in ganeti-masterd · dd36d829

Iustin Pop authored 15 years ago


Currently, lib/luxi.py used lib/serializer.py for encoding/decoding
messages, but the master daemon uses directly the simplejson module.
This is wrong as any non-trivial change to serializer.py will break the
master daemon.

The patch changes masterd to use exactly the same functions as luxi.py
for encoding/decoding of messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

dd36d829

Apr 06, 2009

Disable synchronous (locking) queries · 77921a95

Iustin Pop authored 15 years ago

This patch raises an error in the master daemon in case the user
requests a locking query; accordingly, all clients were modified to send
only lockless queries. This is short-term fix, for proper fix the
clients should be modified to submit a job when the user request a
locking query.

The other approach would be to ignore the flag passed by the client;
this would be worse as client's wouldn't get at least an error.

The possible impact of this is multiple:
  - some commands could have been not converted, and thus fail; this
    can be remedied easily
  - the consistency of commands is lost; e.g. node failover will not
    lock the node *while we get the node info*, so we could miss some
    data; this is again in the thread of atomic operations which are
    missing in the current model of query-and-act from gnt-* scripts

Reviewed-by: imsnah, ultrotter

77921a95

Add some more debugging info to masterd · e566ddbd

Iustin Pop authored 15 years ago

This patch will log data about queries, which are today completely
invisible (at the default log level) in the master log file.

Reviewed-by: imsnah

e566ddbd

Feb 27, 2009

Create runtime dir in bootstrap · 9dae41ad

Guido Trotter authored 16 years ago

Some hypervisors (KVM) need RUN_GANETI_DIR to exist even at cluster init
time. This patch creates it in InitCluster just before hv parameter
checking. Since the code to make list of directories is already repeated
twice in the code, and this would be the third time, we abstract it into
an utils.EnsureDirs function and we call that one from ganti-noded,
ganeti-masterd and bootstrap.

Reviewed-by: iustinp

9dae41ad

Feb 12, 2009

master daemon: allow skipping the voting process · 5de4474d

Iustin Pop authored 16 years ago

This patch introduces a 'force' mode for the master daemon startup where
the voting process is not done, but the user has to confirm manually the
startup (before forking, of course).

Reviewed-by: imsnah

5de4474d

Feb 04, 2009

Add one new luxi query: cluster info · 66baeccc

Iustin Pop authored 16 years ago

This is the last query that RAPI executes via opcodes and is purely
static (config values only). As such, we can convert it safely to a
query instead of job.

Reviewed-by: imsnah

66baeccc

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 21, 2009

Fix some more pylint errors · c979d253

Iustin Pop authored 16 years ago

Two are real errors (invalid names) and one is style error (overriding
name from outer scope).

Reviewed-by: ultrotter

c979d253

Jan 20, 2009

Update the logging output of job processing · d21d09d6

Iustin Pop authored 16 years ago

(this is related to the master daemon log)

Currently it's not possible to follow (in the non-debug runs) the
logical execution thread of jobs. This is due to the fact that we don't
log the thread name (so we lose the association of log messages to jobs)
and we don't log the start/stop of job and opcode execution.

This patch adds a new parameter to utils.SetupLogging that enables
thread name logging, and promotes some log entries from debug to info.
With this applied, it's easier to understand which log messages relate
to which jobs/opcodes.

The patch also moves the "INFO client closed connection" entry to debug
level, since it's not a very informative log entry.

Reviewed-by: ultrotter

d21d09d6

Jan 09, 2009

Rework the daemonization sequence · 7d88772a

Iustin Pop authored 16 years ago

The current fork+close fds sequence has deficiencies which are hard to
work around:
  - logging can start logging before we fork (e.g. if we need to emit
    messages related to master checking), and thus use FDs which we
    can't track nicely
  - the queue locks the queue file, and again this fd needs to be kept
    open which is hard from the main loop (and this error is currently
    hidden by the fact that we don't log it)

Given the above, it's much simpler, in case we will fork later, to close
file descriptors right at the beginning of the program, and in Daemonize
only close/reopen the stdin/out/err fds.

In addition, we also close() the handlers we remove in SetupLogging so
that the cleanup is more thorough.

Reviewed-by: imsnah

7d88772a

Jan 06, 2009
- Fix some pylint-detected issues · e09fdcfa
  Iustin Pop authored 16 years ago
```
Two bad indentation cases and a missing variable.

Reviewed-by: imsnah
```
  e09fdcfa
Dec 18, 2008

Prevent RPC timeout on auto-archiving jobs · f8ad5591

Michael Hanselmann authored 16 years ago

With a large job queue, auto-archiving jobs can take a very long time,
causing timeouts on the luxi RPC layer. With this change, auto-
archive returns after half of the RPC timeout has passed. The user
will see how many jobs are left unchecked.

Reviewed-by: ultrotter

f8ad5591

Dec 11, 2008

Fix epydoc format warnings · c41eea6e

Iustin Pop authored 16 years ago

This patch should fix all outstanding epydoc parsing errors; as such, we
switch epydoc into verbose mode so that any new errors will be visible.

Reviewed-by: imsnah

c41eea6e

Dec 02, 2008

Fix master failover · bbe19c17

Iustin Pop authored 16 years ago

The ssconf files were not updated by the master failover. We need to
push them, and since we already have RPC initialized, we can use the
standard ConfigWriter to do so - this will take care of both the config
file and the ssconf files.

Reviewed-by: imsnah

bbe19c17

Nov 26, 2008

ganeti-masterd: create RUN_GANETI_DIR as well · 1cb8d376

Guido Trotter authored 16 years ago

Since we're not sure ganeti-noded has started yet, we need to create
RUN_GANETI_DIR before SOCKET_DIR as well, with the proper permissions.

Reviewed-by: imsnah

1cb8d376

Nov 25, 2008

Move the MASTER_SOCKET to SOCKET_DIR · 227647ac

Guido Trotter authored 16 years ago

Before it was in the abstract linux namespace, where unfortunately we
couldn't easily check from python the credentials of the connecting
clients. Now we also have to remove the file on exit and when starting.

Reviewed-by: imsnah

227647ac

ganeti-masterd: create SOCKET_DIR · d823660a

Guido Trotter authored 16 years ago

If SOCKET_DIR doesn't exist we create it in the master daemon, before
trying to put a socket inside it.

Reviewed-by: imsnah

d823660a

Nov 21, 2008

ganeti-masterd: Remove PID file at the end · 15486fa7

Michael Hanselmann authored 16 years ago

Removing the PID file should be the last thing done. This patch makes
sure it's also removed when master.server_cleanup() throws an exception.

Also initialize logging only after writing the PID file.

Reviewed-by: iustinp

15486fa7

Reuse HTTP client pool for RPC · 4331f6cd

Michael Hanselmann authored 16 years ago

ganeti-masterd: Add initialization and shutdown of RPC pool. It needs
to be shutdown before forking.

ganeti.cli: Add decorator function to initialize and shutdown RPC pool.

ganeti.rpc: Add functions to initialize and shutdown RPC pool. Throw
exception when used without proper initialization.

gnt-cluster, gnt-node: Use decorator function to initialize and shutdown
RPC pool.

Reviewed-by: iustinp

4331f6cd

Oct 20, 2008

Convert the job queue rpcs to address-based · 99aabbed

Iustin Pop authored 16 years ago

The two main multi-node job queue RPC calls (jobqueue_update,
jobqueue_rename) are converted to address-based calls, in order to speed
up queue changes. For this, we need to change the _nodes attribute on
the jobqueue to be a dict {name: ip}, instead of a set.

Reviewed-by: imsnah

99aabbed

Remove the logger.py module · 82d9caef

Iustin Pop authored 16 years ago

Since now we use only one function from the logger module
(SetupLogging), we move it to utils.py (which is already imported by all
users of this function), and we remove the module.

Reviewed-by: imsnah

82d9caef

Oct 16, 2008

Improvements to the master startup checks · d7cdb55d

Iustin Pop authored 16 years ago

In order to account for future improvements to master failover, we move
the actual data gathering capabilities from ganeti-masterd into
bootstrap.py, and we leave only the verification into masterd.

The verification procedure is then changed to retry multiple times (up
to one minute) in case most nodes do not respond, and also the algorithm
is changed to require at least half (but not half+1) votes, since our
vote also should count (and we vote for ourselves).

Example for consistent (config-wise) cluster:
  - 5 node cluster, 2 nodes down: still start
  - 4 node cluster, 2 nodes down: retry for one minute, abort

Reviewed-by: ultrotter

d7cdb55d

Add an interface for the drain flag changes/query · 3ccafd0e

Iustin Pop authored 16 years ago

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain
$ gnt-cluster queue info
The drain flag is set
$ gnt-cluster queue undrain
$ gnt-cluster queue info
The drain flag is unset

The choice of making the setting via luxi and not an opcode is that
opcodes can't be executed when drained, but we don't query via luxi
since in the future it might become a cluster property as opposed to a
node one.

Reviewed-by: imsnah

3ccafd0e

Oct 15, 2008

Implement transport of ganeti errors across luxi · 6797ec29

Iustin Pop authored 16 years ago

This patch adds a generic method to identify the ganeti error given its
class name, and implements this across the luxi protocol.

Reviewed-by: imsnah

6797ec29

Oct 10, 2008

Convert rpc module to RpcRunner · 72737a7f

Iustin Pop authored 16 years ago

This big patch changes the call model used in internode-rpc from
standalong function calls in the rpc module to via a RpcRunner class,
that holds all the methods. This can be used in the future to enable
smarter processing in the RPC layer itself (some quick examples are not
setting the DiskID from cmdlib code, but only once in each rpc call,
etc.).

There are a few RPC calls that are made outside of the LU code, and
these calls are left as staticmethods, so they can be used without a
class instance (which requires a ConfigWriter instance).

Reviewed-by: imsnah

72737a7f

Oct 07, 2008

Implement job 'waiting' status · e92376d7

Iustin Pop authored 16 years ago

Background: when we have multiple jobs in the queue (more than just a
few), many of the jobs (up to the number of threads) will be in state
'running', although many of them could be actually blocked, waiting for
some locks. This is not good, as one cannot easily see what is
happening.

The patch extends the opcode/job possible statuses with another one,
waiting, which shows that the LU is in the acquire locks phase. The
mechanism for doing so is simple, we initialize (in the job queue) the
opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
control to the LU's Exec, it will call a notifier back into the
_JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
the proper queue locking). Because this mechanism does not save the job,
all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
so we also change the load sequence to consider WAITLOCK as RUNNING.

With the patch applied, creating in parallel (via burnin) five instances
on a five node cluster shows that only two are executing, while three
are waiting for locks.

Reviewed-by: imsnah

e92376d7

Oct 06, 2008

Implement job auto-archiving · 07cd723a

Iustin Pop authored 16 years ago

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

07cd723a

Oct 01, 2008

Convert ganeti-master · a42872ff
Michael Hanselmann authored 16 years ago
```
Use simpleconfig instead of ssconf.

Reviewed-by: iustinp
```
a42872ff

Add new query to get cluster config values · ae5849b5

Michael Hanselmann authored 16 years ago

This can be used to retrieve certain cluster config values from
within clients.

OpDumpClusterConfig was not used anywhere, hence I'm just reusing
it. The way ConfigWriter.DumpConfig returned the configuration
was not thread-safe, anyway (no deepcopy).

Reviewed-by: iustinp

ae5849b5

Sep 09, 2008

Implement master startup safety check · 36205981

Iustin Pop authored 16 years ago

This is an initial version of the master startup checks. It's a very
rudimentary change, however in normal usage (an old master was started,
the rest of the cluster is functioning normally) it will succeed in
preventing wrong startups.

Reviewed-by: imsnah

36205981

Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 27, 2008

Make sure that client programs get all messages · 6c5a7090

Michael Hanselmann authored 16 years ago

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added
too fast, thereby making commands like “gnt-cluster verify” less than
useful. These changes now introduce some sort a serial number per
log entry to keep track what message a client already received. They
also remove the log lock per opcode to make reading log entries thread
safe.

Reviewed-by: ultrotter

6c5a7090

Aug 18, 2008

Use Linux-specific way to name master socket · 9894ece7

Michael Hanselmann authored 16 years ago

By using this Linux-specific way we don't have to care about removing the
socket file when quitting or starting (after an unclean shutdown). For a
more detailed description, see the comment in the patch.

Reviewed-by: schreiberal

9894ece7

Aug 11, 2008

Add RPC call to wait for job changes · dfe57c22

Michael Hanselmann authored 16 years ago

This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter

dfe57c22

Aug 08, 2008
- Add query function for exports · 32f93223
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  32f93223
Aug 06, 2008

Notify job queue about added/removed nodes · c36176cc

Michael Hanselmann authored 16 years ago

The job queue maintains its own node list and must be notified
when nodes are added/removed.

Reviewed-by: iustinp

c36176cc

Implement {Add,Readd,Remove}Node in GanetiContext · d8470559

Michael Hanselmann authored 16 years ago

By doing this we've a central place which coordinates what needs to be
done when adding or removing nodes. Another patch will add calls into
the job queue.

Two log messages move to config.py.

When removing a node, node_leave_cluster is now called after it has
been removed from the configuration and job manager. That way we're
sure not to access the node again after files have been removed.

Reviewed-by: iustinp

d8470559

jqueue: Don't pass the list of nodes to SubmitJob anymore · 4c848b18

Michael Hanselmann authored 16 years ago

The job queue now maintains its own list and is updated when
nodes are added or removed from the cluster.

Reviewed-by: iustinp

4c848b18