Commits · 60975797da6d4802e0cdd0ddc3dedb6b24aed041 · itminedu / snf-ganeti

Aug 04, 2009

rpc: add rpc call for getting disk size · 968a7623

Iustin Pop authored 15 years ago


Note that this exports the disk size as bdev returns it, in bytes. The
value will be converted to MiB in cmdlib.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

968a7623

Jul 29, 2009

Extend call_node_start_master rpc with no_voting · 2503680f

Guido Trotter authored 15 years ago


When the parameter is set to True and start_daemons is also True,
ganeti-masterd will be started with the new --no-voting --yes-do-it
options.

This new option is set to True only on masterfailover, when no_voting is
used. This changed the behavior from 2.0, where we didn't start the
master daemon at all, when this option was used.

The manpage is also updated to remove the 2.0 only change.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2503680f

Jul 08, 2009

ganeti-masterd: allow non-interactive --no-voting · 5e96d216

Guido Trotter authored 15 years ago


This will be used by ganeti-noded to start ganeti-masterd in a
--no-voting masterfailover.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5e96d216

May 25, 2009

watcher: automatically restart noded/rapi · c4f0219c

Iustin Pop authored 15 years ago


This patch makes the watcher automatically restart the node and rapi
daemons, if they are not running (as per the PID file).

This is not an exhaustive test; a better one would be TCP connect to the
port, and an even better one a simple protocol ping (e.g. get / for rapi
and a rpc_call_alive for noded), but since we don't know how they've
been started we can't implement it today. rapi would need to write the
SSL/port to a file, and noded something similar, so that we know how to
connect.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c4f0219c

watcher: handle full and drained queue cases · 24edc6d4

Iustin Pop authored 15 years ago


Currently the watcher is broken when the queue is full, thus not
fulfilling its job as a queue cleaner. It also doesn't handle nicely the
queue drained status.

This patch does a few changes:
  - first archive jobs, and only after submit jobs; this fixes the case
    where the queue is already full and there are jobs suited for
    archiving (but not the case where the jobs all too young to be
    archived)
  - handle nicely the job queue full and drained cases—instead of
    tracebacks, log such cases nicely
  - reverse the initial value and special cases for update_file; we now
    whitelist instead of blacklist cases, since we have much more
    blacklist cases than vice versa, and we set the flag to True only
    after the run is successful

The last change, especially, is a significant one: now errors during the
watcher run will not update the status file, and thus they won't be lost
again in the logs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

24edc6d4

May 20, 2009

watcher: write the instance status to a file · 78f44650

Iustin Pop authored 15 years ago


This patch modifies the watcher to keep on-disk a file with the instance
status; this can be used from outside of ganeti to react to instances
being down (when the watcher cannot restart them).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

78f44650

May 19, 2009

watcher: try to restart the master if down · 7dfb83c2

Iustin Pop authored 15 years ago


Bugs in either our code or in associated libraries can bring the master daemon
down, and this (due to the 2.0 architecture) stops all work on the cluster.

Since the watcher already does periodic checks on the cluster, we modify
it to try to start the master automatically in case of failures to
connect. This will be tried only once per cycle.

Also, in this case, we modify the code so that the watcher status file
is not updated - its timestamp will reflect thus the time of last
successful connection to the master.

Side note: the except errors.ConfigurationError part could be cleaned
up, since in 2.0 we don't usually get that directly, and if we do it's
an error and we shouldn't touch the file anyway; but that is not a rc5
change.

Signed-off-by: Iustin Pop <iustin@google.com>

7dfb83c2

May 05, 2009

ganeti-noded: add bind address option · cf192249

Guido Trotter authored 15 years ago

This allows ganeti-noded to bind only on one interface rather than all
the ones on the machine. The default behaviour doesn't change.

Signed-off-by: Guido Trotter <ultrotter@google.com>

cf192249

May 04, 2009

Fix luxi serialization in ganeti-masterd · dd36d829

Iustin Pop authored 15 years ago


Currently, lib/luxi.py used lib/serializer.py for encoding/decoding
messages, but the master daemon uses directly the simplejson module.
This is wrong as any non-trivial change to serializer.py will break the
master daemon.

The patch changes masterd to use exactly the same functions as luxi.py
for encoding/decoding of messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

dd36d829

Apr 06, 2009

Disable synchronous (locking) queries · 77921a95

Iustin Pop authored 15 years ago

This patch raises an error in the master daemon in case the user
requests a locking query; accordingly, all clients were modified to send
only lockless queries. This is short-term fix, for proper fix the
clients should be modified to submit a job when the user request a
locking query.

The other approach would be to ignore the flag passed by the client;
this would be worse as client's wouldn't get at least an error.

The possible impact of this is multiple:
  - some commands could have been not converted, and thus fail; this
    can be remedied easily
  - the consistency of commands is lost; e.g. node failover will not
    lock the node *while we get the node info*, so we could miss some
    data; this is again in the thread of atomic operations which are
    missing in the current model of query-and-act from gnt-* scripts

Reviewed-by: imsnah, ultrotter

77921a95

Fix the output of watcher on non-master nodes · 2c404217

Iustin Pop authored 15 years ago

Currently the watcher spews errors message on non-master nodes. This
cleans it up.

Reviewed-by: imsnah

2c404217

Change the watcher to use jobs instead of queries · 6dfcc47b

Iustin Pop authored 15 years ago

As per the mailing list discussion, this patch changes the watcher to
use a single job (two opcodes) for getting the cluster state (node list
and instance list); it will then compute the needed actions based on
this data.

The patch also archives this job and the verify-disks job.

Reviewed-by: imsnah

6dfcc47b

Add some more debugging info to masterd · e566ddbd

Iustin Pop authored 15 years ago

This patch will log data about queries, which are today completely
invisible (at the default log level) in the master log file.

Reviewed-by: imsnah

e566ddbd

Mar 09, 2009

watcher: fix startup sequence locking the master · cc962d58

Iustin Pop authored 16 years ago

Currently, the watcher startup sequence does:
  - open a luxi client
  - get the instance list
  - get the node boot ids
  - open and lock the status file, and:
    - archive jobs
    - restart the down instances
    - check disks

This, of course, can lead to problems when a node is (genuinely or not)
locked for more than (watcher interval * maximum query clients) time. At
that time, the master is completely unresponsive until the node is
unlocked and all the watchers exit with error due to the state file
being locked by the first instance.

This patch reworks the startup sequence to first open/lock the status
file, and only then open a luxi client. This should prevent the above
case.

Reviewed-by: ultrotter

cc962d58

Feb 27, 2009

Create runtime dir in bootstrap · 9dae41ad

Guido Trotter authored 16 years ago

Some hypervisors (KVM) need RUN_GANETI_DIR to exist even at cluster init
time. This patch creates it in InitCluster just before hv parameter
checking. Since the code to make list of directories is already repeated
twice in the code, and this would be the third time, we abstract it into
an utils.EnsureDirs function and we call that one from ganti-noded,
ganeti-masterd and bootstrap.

Reviewed-by: iustinp

9dae41ad

Feb 24, 2009

Remove the extra_args parameter in instance start · 07813a9e

Iustin Pop authored 16 years ago

This patch removes the extra_args parameter and instead switches the
instance to the HV_KERNEL_ARGS hypervisor option.

This is a big change, but it's a needed cleanup, this extra parameter on
all RPC calls is not generic and we also need to have a persistent value
here.

Reviewed-by: imsnah

07813a9e

Feb 16, 2009

watcher: fix checking of boot IDs · 3448aa22

Iustin Pop authored 16 years ago

The recent change (commit 2151) to the watcher to make it handle offline
nodes also saves the offline attribute to the state file, but this is
not needed and also breaks the checking of the boot ID. This patch
simply removes it, restoring the correct behaviour.

Reviewed-by: imsnah

3448aa22

watcher: autoarchive old jobs · f07521e5

Iustin Pop authored 16 years ago

This patch adds auto-archiving of jobs older than 6 hours to the
watcher.

Reviewed-by: imsnah

f07521e5

Feb 13, 2009

RAPI: fixes related to write mode · 6e99c5a0

Iustin Pop authored 16 years ago

This patch fixes many small issues related to write functions:
  - update documentations w.r.t. how to add users
  - update the instance add function for latest API
  - add instance delete
  - fix addition of tags
  - update some error messages

Reviewed-by: imsnah

6e99c5a0

RAPI: format error messages as JSON · 1f8588f6

Iustin Pop authored 16 years ago

This patch changes the format of the HTTP error messages from text/html, which
is hard to parse from RAPI clients, to JSON which can be automatically parsed.

The error message is an object, which contains always three keys:
  - code, an integer with the error code
  - message, a short description
  - explain, holding (if available) a description of the error

In order to implement this, there is a bit of change to the http server
and executor classes. I've tested and the error handling still works
(but less optimal, no error message) in case the error formatting itself
raises an exception.

Reviewed-by: imsnah

1f8588f6

Make RAPI return 502/504 errors for luxi errors · 77e1d753

Iustin Pop authored 16 years ago

This changes the RAPI error codes for luxi errors; a timeout error is
now reported properly as 504, while any other luxi error is reported as
502.

It would be good to convert even more errors into proper return codes in
the future.

Reviewed-by: imsnah

77e1d753

Fix ganeti-rapi startup with missing certificate · 6bb258a7

Iustin Pop authored 16 years ago

This patch displays a nicer error message compared to the default
stacktrace.

Reviewed-by: imsnah

6bb258a7

Feb 12, 2009

master daemon: allow skipping the voting process · 5de4474d

Iustin Pop authored 16 years ago

This patch introduces a 'force' mode for the master daemon startup where
the voting process is not done, but the user has to confirm manually the
startup (before forking, of course).

Reviewed-by: imsnah

5de4474d

Switch the instance_shutdown rpc to (status, data) · 1fae010f

Iustin Pop authored 16 years ago

This patch changes the return type from this RPC call to include status
information and renames the backend method to match the RPC call name.

The patch is a little bigger than the reboot one, since this call is
used in more than one place. However, all the points of call have the
same usage pattern, so the patch is trivial.

Reviewed-by: ultrotter

1fae010f

Switch the instance_reboot rpc to (status, data) · 489fcbe9

Iustin Pop authored 16 years ago

This small patch changes the return type from this RPC call to include
status information and renames the backend method to match the RPC call
name.

Reviewed-by: ultrotter

489fcbe9

Feb 11, 2009

ganeti-noded: Create LOCK_DIR if missing · f2ffd244

Guido Trotter authored 16 years ago

We need this directory for locks, so if for any reason it's not there
we'll create it. The permissions are the standard /var/lock permissions.

Reviewed-by: iustinp

f2ffd244

Feb 09, 2009

Uniformize some function names in backend.py · 821d1bd1

Iustin Pop authored 16 years ago

Currently, the names of the functions in backend.py that are actually
RPC procedures and are called from ganeti-noded are not corresponding to
the RPC names. This makes it hard to actually see which functions are
exported and which functions are internal to backend.

This patch renames all blockdevice-related functions in backend.py match
the name of the RPC call (without the ‘call’ or ‘perspective’ prefix).
This should make it easier to grep for a given function called in
cmdlib, without having to open and check in ganet-inoded what backend
function it corresponds to.

The patch also does two minor extra cleanups (rename a variable and
change a logging level).

Reviewed-by: ultrotter

821d1bd1

rpc.call_blockdev_find: convert to (status, data) · 23829f6f

Iustin Pop authored 16 years ago

This patch converts the call_blockdev_find - which searches for block
devices and returns their status - to the (status, data) format. We also
modify the backend function name to match the rpc call.

Reviewed-by: ultrotter

23829f6f

Fix handling OS errors in AddOSToInstance · 1268d6fd

Iustin Pop authored 16 years ago

This patch fixes the error handling in the add OS to instance function
with regard to invalid OSes. Previously, we didn't handle any such
errors, with the end result that the user would have to look in the node
daemon log.

The patch also renames the name of the function to match the RPC call
name.

Reviewed-by: ultrotter

1268d6fd

Feb 05, 2009

rapi: fix SSL mode and use SSL by default · 2ed6a7d6

Iustin Pop authored 16 years ago

This patch fixes the SSL mode (by actually constructing SSL parameters
from the command line options) and enables SSL by default; the old “-S”
option which enabled SSL is now changed to “--no-ssl”. The certificate
and key are by default pointing to the Ganeti auto-generated certificate
for rapi.

Reviewed-by: imsnah

2ed6a7d6

Feb 04, 2009

rapi: fix authentication and queries · 85414b69

Iustin Pop authored 16 years ago

For queries, we don't want to require authentication. We fix this by adding an
override GetAuthRealm in the rapi daemon.

We also fix a method name.

Reviewed-by: imsnah

85414b69

Add one new luxi query: cluster info · 66baeccc

Iustin Pop authored 16 years ago

This is the last query that RAPI executes via opcodes and is purely
static (config values only). As such, we can convert it safely to a
query instead of job.

Reviewed-by: imsnah

66baeccc

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 21, 2009

Fix some more pylint errors · c979d253

Iustin Pop authored 16 years ago

Two are real errors (invalid names) and one is style error (overriding
name from outer scope).

Reviewed-by: ultrotter

c979d253

Add calls in the intra-node migration protocol · 6906a9d8

Guido Trotter authored 16 years ago

Currently the hypervisor is expected to do all the migration from the
source side. With this patch we also add the option of passing some
information to the target side, and starting some operation there.

As a bonus, a function to cleanup any started operation is included.

Reviewed-by: iustinp

6906a9d8

Jan 20, 2009

Update the logging output of job processing · d21d09d6

Iustin Pop authored 16 years ago

(this is related to the master daemon log)

Currently it's not possible to follow (in the non-debug runs) the
logical execution thread of jobs. This is due to the fact that we don't
log the thread name (so we lose the association of log messages to jobs)
and we don't log the start/stop of job and opcode execution.

This patch adds a new parameter to utils.SetupLogging that enables
thread name logging, and promotes some log entries from debug to info.
With this applied, it's easier to understand which log messages relate
to which jobs/opcodes.

The patch also moves the "INFO client closed connection" entry to debug
level, since it's not a very informative log entry.

Reviewed-by: ultrotter

d21d09d6

Jan 13, 2009

Forward-port DrbdNetReconfig · 6b93ec9d

Iustin Pop authored 16 years ago

This is a modified forward-port of DrbdNetReconfig and their associated
RPCs. In Ganeti 2.0, these functions will be used for two things:
  - live migration (as in 1.2)
  - and for other network reconfiguration tasks, since DRBD8.Attach()
    doesn't do them anymore

Because of the Attach() changes, we can now implement the
AttachNet/DisconnectNet functions as independent entities, and we don't
need the cache anymore.

Note these functions are copies of the latest 1.2 code, and not
cherry-picks of the (many) patches that went into 1.2.

Reviewed-by: ultrotter

6b93ec9d

Small typo in ganeti-watcher · 4bffa7f7
Iustin Pop authored 16 years ago
```
Reviewed-by: imsnah
```
4bffa7f7

Jan 09, 2009

Rework the daemonization sequence · 7d88772a

Iustin Pop authored 16 years ago

The current fork+close fds sequence has deficiencies which are hard to
work around:
  - logging can start logging before we fork (e.g. if we need to emit
    messages related to master checking), and thus use FDs which we
    can't track nicely
  - the queue locks the queue file, and again this fd needs to be kept
    open which is hard from the main loop (and this error is currently
    hidden by the fact that we don't log it)

Given the above, it's much simpler, in case we will fork later, to close
file descriptors right at the beginning of the program, and in Daemonize
only close/reopen the stdin/out/err fds.

In addition, we also close() the handlers we remove in SetupLogging so
that the cleanup is more thorough.

Reviewed-by: imsnah

7d88772a

Jan 08, 2009

Add an instance_migratable rpc call · 56e7640c

Iustin Pop authored 16 years ago

This is a forward-port of commit 1194 on the 1.2 branch:

  This call will check whether an instance is up on its primary, and that
  it has been started with symlinks. We currently have no on-secondary
  checks, nor any hypervisor specific call.

  Reviewed-by: iustinp

The difference from the original patch is that we don't include the
cmdlib changes, since those will come as a copy from the 1.2 cmdlib.py,
and not as individual patches.

Original-Author: ultrotter

56e7640c