Commits · 4f366caf38c9534c2c740bc8f51106c72f68bab9 · itminedu / snf-ganeti

Jul 26, 2009
- Collapse SSL key checking/overriding for daemons · 3b1b0cb6
  Guido Trotter authored 15 years ago
```
Signed-off-by: Guido Trotter <ultrotter@google.com>
```
  3b1b0cb6
Jul 25, 2009

Collapse daemon's main function · 04ccf5e9

Guido Trotter authored 15 years ago


With three ganeti daemons, and one or two more coming, the daemon's main
function started becoming too much cut&pasted code. Collapsing most of
it in a daemon.GenericMain function. Some more code could be collapsed
between the two http-based daemons, but since the new daemons won't be
http-based we won't do it right now.

As a bonus a functionality for overriding the network port on the
command line for all network based nodes is added.

Signed-off-by: Guido Trotter <ultrotter@google.com>

04ccf5e9

Jul 24, 2009

Remove <DAEMON>_PID constants · 83052f9e

Guido Trotter authored 15 years ago


The <DAEMON>_PID constants were created to reference a daemon pid file,
but actually contain a daemon's name, because the various functions that
work with pidfiles abstract the filename from the daemon name
themselves. Removing the constants and using the actual daemon name
constants in their place.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

83052f9e

Slightly abstract the daemon logfile lookup · dae3fdd2

Guido Trotter authored 15 years ago

The original LOG_<DAEMON_NAME> constants for daemon logfiles are gone.
In their place there is a DAEMONS_LOGFILES dict, indexed by daemon name.

This is a minor change with the objective to uniform most of the
daemon's main() functions code, which is very similar one to the other.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

dae3fdd2

Move rapi to GetDaemonPort · 8c96d01f

Guido Trotter authored 15 years ago


Currently rapi is the only daemon which accepts a port option, rather
than querying its own port from services, and failing back to the
default if not found. Changing this to conform to what other daemons do.

Also update the ganeti-rapi(8) manpage

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

8c96d01f

Change GetNodeDaemonPort to GetDaemonPort in utils · cd50653c

Guido Trotter authored 15 years ago


GetNodeDaemonPort is used to lookup the node daemon port in the services
file, and if not found to return the default one. We make it a generic
function, which accepts the daemon name in input, so that it can be used
by confd as well, to lookup its own udp port.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

cd50653c

Jul 23, 2009

Remove references to utils.debug · 68b1fcd5

Guido Trotter authored 15 years ago


Various modules set it to True when called in debugging mode, but the
utils module supports no such global.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

68b1fcd5

ganeti-rapi, replace hardcoded exit value · be73fc79

Guido Trotter authored 15 years ago


substitute exit(1) with exit(constants.EXIT_FAILURE).
Also fix a wrongly indented line.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

be73fc79

Add the bind-address option to ganeti-rapi · 8790ac54

Guido Trotter authored 15 years ago


Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

8790ac54

Jul 22, 2009

noded: Abstract hard-coded sys.exit value · 46479775

Guido Trotter authored 15 years ago


On machines without the ssl file noded exists '5'.
Changing this to constants.EXIT_NOTCLUSTER.

Also utils.GetNodeDaemonPort hasn't risen errors.ConfigurationError for
a while, so removing that try/except block.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

46479775

Jul 19, 2009

Add a luxi call for multi-job submit · 56d8ff91

Iustin Pop authored 16 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
(cherry picked from commit 2971c913)

56d8ff91

Jul 14, 2009

ganeti-masterd: avoid SimpleConfigReader · b2890442

Guido Trotter authored 15 years ago

SimpleStore is a lot less heavyweight than SimpleConfigReader, and to
just get the master name we can use that. This is the only usage of
SimpleConfigReader currently, but we're not going to delete the class,
as new usages will come in for ganeti-confd (in 2.1). Using it there,
though, will make the class even more heavy to load, so it makes sense
for this simple usage to be converted.

Signed-off-by: Guido Trotter <ultrotter@google.com>

b2890442

Jul 08, 2009

Extend call_node_start_master rpc with no_voting · 3583908a

Guido Trotter authored 15 years ago


When the parameter is set to True and start_daemons is also True,
ganeti-masterd will be started with the new --no-voting --yes-do-it
options.

This new option is set to True only on masterfailover, when no_voting is
used. This changed the behavior from 2.0, where we didn't start the
master daemon at all, when this option was used.

The manpage is also updated to remove the 2.0 only change.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

3583908a

ganeti-masterd: allow non-interactive --no-voting · 5e96d216

Guido Trotter authored 15 years ago


This will be used by ganeti-noded to start ganeti-masterd in a
--no-voting masterfailover.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5e96d216

Jul 07, 2009

Fix problem with EAGAIN on socket connection in clients · 6096ee13

Michael Hanselmann authored 15 years ago

If a user used ^Z to stop the program, poll() in socket.recv would return
EAGAIN due to SIGSTOP. This patch changes luxi.Transport.Recv to ignore EAGAIN.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6096ee13

Jun 29, 2009

Rename the volume_list RPC call to lv_list · b2a6ccd4

Iustin Pop authored 15 years ago


There are volume-related rpc calls. This patch renames the ‘volume_list’
call to ‘lv_list’ to make more clear its purpose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b2a6ccd4

Jun 15, 2009

Simplify the RPC result framework in backend.py · c26a6bd2

Iustin Pop authored 16 years ago


Since now all functions fail via _Fail, the return True, … is redundant
as all normal return paths have it, and thus the True value can be added
in the ganeti-noded handler.

This means that all functions can now forget about the special result
type, and instead return normally, but signal all failures via _Fail().

Only a few functions must be handled specially (the recursive ones).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

c26a6bd2

Implement result-type restriction in ganeti-noded · 4dd42c9d

Iustin Pop authored 16 years ago


Since all rpc calls were converted, we can now:
  - enforce result type to (status, data)
  - convert all unhandled exceptions to (False, str(err))

This makes sure that all unhandled errors are reported to rpc users.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

4dd42c9d

Big rewrite of the OS-related functions · 255dcebd

Iustin Pop authored 16 years ago


Currently the OSes have a special, customized error handling: the OS
object can represent either a valid OS, or an invalid OS. The associated
function, instead of raising other exception or failing, create custom
OS objects representing failed OSes.

While this was good when no other RPC had failure handling, it's
extremely different from how other function in backend.py expect
failures to be signalled.

This patch reworks this completely:
  - the OS object always represents valid OSes (the next patch will
    remove the valid/invalid field and associated constants)
  - the call_os_diagnose returns instead of a list of OS objects, a list
    of (name, path, status, diagnose_msg); the status is then used in
    cmdlib to determine validity and the status and diagnose_msg values
    are used in gnt-os for display
  - call_os_get returns either a valid OS or a RPC remote failure (with
    the error message)
  - the other functions in backend.py now just call backend.OSFromDisk()
    which will return either a valid OS object or raise an exception
  - the bulk of the OSFromDisk was moved to _TryOSFromDisk which returns
    status, value for the functions which don't want an exception raised

The gnt-os list and diagnose commands still work after this patch.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

255dcebd

Convert the jobqueue rpc to new style result · c8457ce7

Iustin Pop authored 16 years ago


This patch converts the job queue rpc calls to the new style result.
It's done in a single patch as there are helper function (in both jqueue
and backend) that are used by multiple rpcs and need synchronized
change.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

c8457ce7

Convert os_diagnose rpc to new style result · 83d92ad8

Iustin Pop authored 16 years ago


This also removes custom post-processing from rpc.py; since this call
has only one user, it was simple to move it back to the caller.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

83d92ad8

Convert call_version rpc to new style result · 90b54c26

Iustin Pop authored 16 years ago


This also cleans up its single use in cmdlib.py.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

90b54c26

Conver node_leave_cluster rpc to new style result · 0623d351

Iustin Pop authored 16 years ago


This patch converts this rpc call to the new style result, and also
changes in the process the meaning of the QuitGanetiException's
arguments and the node daemon rpc call exception handler.

The problem with the exception handler is that we used a two-stage one,
and the inner used to catch all exception (including this one), so in
the logs we always had an exception logged, instead of the normal
'leaving cluster message'. The patch also adds logging of the
exception's arguments, so that we have a trail in the logs about the
shutdown mode.

The exception's arguments were reversed from the normal RPC results
style. While it makes somewhat more sense for this exception, we change
them such that they match the rpc result format.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

0623d351

Convert node_start_master to new style result · b726aff0

Iustin Pop authored 16 years ago


This is used in multiple places outside cmdlib.py, so it's a more
interesting patch.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

b726aff0

Convert node_has_ip_address rpc to new style · c2fc8250

Iustin Pop authored 16 years ago


This should actually have a function in backend, but it's fine for now.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

c2fc8250

Convert instance_list rpc to new style result · aca13712

Iustin Pop authored 16 years ago


Since backend.GetInstanceList() is used both as RPC endpoint and as
internal function, it can't return (status, value). Instead it returns
only valid instance info, and failures are denoted by exceptions; and
the ganeti-noded function adds the (True,) status.

The patch also fixes a typo.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

aca13712

Convert volume_list rpc to new style result · 29d376ec

Iustin Pop authored 16 years ago


This is a big change, because we need to cleanup its users too.

The call and thus LUVerifyDisks LU used to differentiate between failure
at node level and failure at LV level, by returning different types in
the RPC result. This is way too complicated for our needs.

The patch changes to new style result (easy change), and then:
  - changes LUVerifyDisks.Exec() to return a tuple of 3-elements
    instead of 4-elements; we collapse the «nodes not reachable» and
    «nodes with LVM errors» in a single dict
  - changes gnt-cluster to parse 3-element results and simplifies the
    different by-error handling code

Note that the status is added in ganeti-noded, and not in the function
itself, as the function is used in other places too.

This was tested with down nodes and broken VGs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

29d376ec

Convert export_info rpc to new style result · 3eccac06

Iustin Pop authored 16 years ago


This also removes some code from ganeti-noded and rpc.py, which should
not do such processing of data (and be simply glue code). (Or
alternatively they could, if we had better infrastructure).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

3eccac06

Jun 09, 2009

rpc: Add a simple failure reporting framework · 2cc6781a

Iustin Pop authored 16 years ago


This patch adds a simple failure reporting tool, similar to bdev's
_ThrowError. In backend, we move towards the new-style RPC results (of
type (status, payload)) and thus functions which use this style can very
easily log and return the error message using this new function.

The exception is declared here and not in errors.py since it's local to
the node-daemon/backend combination.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

2cc6781a

May 27, 2009

Add a node powercycle command · f5118ade

Iustin Pop authored 16 years ago


This (somewhat big) patch adds support for remotely rebooting the nodes
via whatever support the hypervisor has for such a concept.

For KVM/fake (and containers in the future) this just uses sysrq plus a
‘reboot’ call if the sysrq method failed. For Xen, it first tries the
above, and then Xen-hypervisor reboot (we first try sysrq since that
just requires opening a file handle, whereas xen reboot means launching
an external utility).

The user interface is:

    # gnt-node powercycle node5
    Are you sure you want to hard powercycle node node5?
    y/[n]/?: y
    Reboot scheduled in 5 seconds

The node reboots hopefully after sending the reply. In case the clock is
broken, “time.sleep(5)” might take ages (but then I suspect SSL
negotiation wouldn't work).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

f5118ade

May 25, 2009

watcher: automatically restart noded/rapi · c4f0219c

Iustin Pop authored 16 years ago


This patch makes the watcher automatically restart the node and rapi
daemons, if they are not running (as per the PID file).

This is not an exhaustive test; a better one would be TCP connect to the
port, and an even better one a simple protocol ping (e.g. get / for rapi
and a rpc_call_alive for noded), but since we don't know how they've
been started we can't implement it today. rapi would need to write the
SSL/port to a file, and noded something similar, so that we know how to
connect.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

c4f0219c

watcher: handle full and drained queue cases · 24edc6d4

Iustin Pop authored 16 years ago


Currently the watcher is broken when the queue is full, thus not
fulfilling its job as a queue cleaner. It also doesn't handle nicely the
queue drained status.

This patch does a few changes:
  - first archive jobs, and only after submit jobs; this fixes the case
    where the queue is already full and there are jobs suited for
    archiving (but not the case where the jobs all too young to be
    archived)
  - handle nicely the job queue full and drained cases—instead of
    tracebacks, log such cases nicely
  - reverse the initial value and special cases for update_file; we now
    whitelist instead of blacklist cases, since we have much more
    blacklist cases than vice versa, and we set the flag to True only
    after the run is successful

The last change, especially, is a significant one: now errors during the
watcher run will not update the status file, and thus they won't be lost
again in the logs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

24edc6d4

May 21, 2009

Add a luxi call for multi-job submit · 2971c913

Iustin Pop authored 16 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

2971c913

May 20, 2009

watcher: write the instance status to a file · 78f44650

Iustin Pop authored 16 years ago


This patch modifies the watcher to keep on-disk a file with the instance
status; this can be used from outside of ganeti to react to instances
being down (when the watcher cannot restart them).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

78f44650

May 19, 2009

watcher: try to restart the master if down · 7dfb83c2

Iustin Pop authored 16 years ago


Bugs in either our code or in associated libraries can bring the master daemon
down, and this (due to the 2.0 architecture) stops all work on the cluster.

Since the watcher already does periodic checks on the cluster, we modify
it to try to start the master automatically in case of failures to
connect. This will be tried only once per cycle.

Also, in this case, we modify the code so that the watcher status file
is not updated - its timestamp will reflect thus the time of last
successful connection to the master.

Side note: the except errors.ConfigurationError part could be cleaned
up, since in 2.0 we don't usually get that directly, and if we do it's
an error and we shouldn't touch the file anyway; but that is not a rc5
change.

Signed-off-by: Iustin Pop <iustin@google.com>

7dfb83c2

May 06, 2009

Inform the OS create script of reinstalls · e557bae9

Guido Trotter authored 16 years ago


Sometimes reinstalls are slightly different than new installs. For
example certain partitions may need to be preserved accross reinstalls.
In order to do that on a per-os basis we pass in the INSTANCE_REINSTALL
variable to inform the create script about when a reinstall is
happening.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e557bae9

May 05, 2009

ganeti-noded: add bind address option · cf192249

Guido Trotter authored 16 years ago

This allows ganeti-noded to bind only on one interface rather than all
the ones on the machine. The default behaviour doesn't change.

Signed-off-by: Guido Trotter <ultrotter@google.com>

cf192249

May 04, 2009

Fix luxi serialization in ganeti-masterd · dd36d829

Iustin Pop authored 16 years ago


Currently, lib/luxi.py used lib/serializer.py for encoding/decoding
messages, but the master daemon uses directly the simplejson module.
This is wrong as any non-trivial change to serializer.py will break the
master daemon.

The patch changes masterd to use exactly the same functions as luxi.py
for encoding/decoding of messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

dd36d829

Apr 06, 2009

Disable synchronous (locking) queries · 77921a95

Iustin Pop authored 16 years ago

This patch raises an error in the master daemon in case the user
requests a locking query; accordingly, all clients were modified to send
only lockless queries. This is short-term fix, for proper fix the
clients should be modified to submit a job when the user request a
locking query.

The other approach would be to ignore the flag passed by the client;
this would be worse as client's wouldn't get at least an error.

The possible impact of this is multiple:
  - some commands could have been not converted, and thus fail; this
    can be remedied easily
  - the consistency of commands is lost; e.g. node failover will not
    lock the node *while we get the node info*, so we could miss some
    data; this is again in the thread of atomic operations which are
    missing in the current model of query-and-act from gnt-* scripts

Reviewed-by: imsnah, ultrotter

77921a95

Fix the output of watcher on non-master nodes · 2c404217

Iustin Pop authored 16 years ago

Currently the watcher spews errors message on non-master nodes. This
cleans it up.

Reviewed-by: imsnah

2c404217