Commits · 600eeb319f30463e7feea89306dabe794c0d4a03 · itminedu / snf-ganeti

Jun 22, 2011
- Add network query LUXI method · 600eeb31
  Apollon Oikonomopoulos authored 13 years ago
```
Signed-off-by: Apollon Oikonomopoulos <apollon@noc.grnet.gr>
```
  600eeb31
May 02, 2011

luxi: do not handle KeyboardInterrupt · d143f2c6

Iustin Pop authored 13 years ago


With the current code, it's possible to mistake a ^C for a protocol
error:

node1# gnt-job info 221691
[press ^C]
Unhandled protocol error while talking to the master daemon:
Error while deserializing response:

(and note empty error message).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

d143f2c6

Jan 07, 2011

luxi.Client: Add function to close connection · 2a917701

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

2a917701

Jan 06, 2011

Convert “gnt-debug locks” to query2 · 24d16f76

Michael Hanselmann authored 14 years ago


Locks can now be queried using “Query(what="lock", …)” over LUXI.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

24d16f76

Dec 20, 2010

Fix timeout handling in LUXI client · 28e3e216

Michael Hanselmann authored 14 years ago


If the socket can't be read in time, it raises “socket.timeout”, for
which there is special handling code. Unfortunately the exception block
was in the wrong order and “socket.error” caught it before.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

28e3e216

Dec 13, 2010

LUXI: Add Query and QueryFields functions · 28b71a76

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

28b71a76

Dec 01, 2010

Querying node groups: add luxi.REQ_QUERY_GROUPS · a79ef2a5

Adeodato Simo authored 14 years ago


This also updates masterd.py.

Signed-off-by: Adeodato Simo <dato@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

a79ef2a5

Nov 01, 2010

luxi: disable two lint errors · 2317945a

Guido Trotter authored 14 years ago


This is already disabled for the same type of request a couple of lines
above. The new code was introduced in e986f20c but didn't have the
disables.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

2317945a

Oct 28, 2010

Add support and checks for version in LUXI · e986f20c

Michael Hanselmann authored 14 years ago


A new constant, LUXI_VERSION, is used to verify the peer's version. The
version is optional, so old(er) clients and servers talking to peers not
supporting it won't break. Example with mismatching library:

$ gnt-instance list
Unhandled Ganeti error: LUXI version mismatch, server 2020000, request
1010000

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

e986f20c

luxi.ProtocolError: Derive from errors.LuxiError · 7a8bda3f

Michael Hanselmann authored 14 years ago


This allows LUXI errors to be encoded and serialized.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7a8bda3f

Aug 24, 2010

Add simple lock monitor · 19b9ba9a

Michael Hanselmann authored 14 years ago


This patch adds an initial implementation of a lock monitor, accessible
for the user through “gnt-debug locks”. It currently shows all resource
locks: BGL, nodes and instances. Config and job queue locks could be
shown too, but wouldn't be of much help.  The current owner(s) and mode
are also shown.

Showing pending acquires will require further changes on the SharedLock
internals and is not yet implemented.

Example output:
$ gnt-debug locks -o name,mode,owner
Name            Mode      Owner
BGL/BGL         shared    JobQueue19/Job147
instances/inst1 exclusive JobQueue19/Job147
instances/inst2 -         -
instances/inst3 -         -
instances/inst4 -         -
nodes/node1     exclusive JobQueue19/Job147
nodes/node2     exclusive JobQueue19/Job147

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

19b9ba9a

Jul 28, 2010

luxi: convert permission errors into exception · 5a1c22fe

Iustin Pop authored 14 years ago


This patch adds handling of permission errors so that we don't show
tracebacks when a non-root user runs a gnt-* command. Since in the
future we'll have different permissions, we need to handle this in RAPI
too.

It also fixes a typo in RAPI error message and the docstrings of LUXI
errors.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

5a1c22fe

May 18, 2010

Abstract the LUXI eom into a constant · 25942a6c

Guido Trotter authored 14 years ago


Currently the EOM terminator is hardcoded on the server side, and is
customizable in the Transport object (with the default being the same as
the value found in the server), but not in the luxi client.

With this patch we move the value to constants, and remove the "fake"
customizability, which would just break client/server communication. If
we ever need to have a luxi transport with a different terminator it's
easy enough to add it back.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

25942a6c

May 11, 2010

RAPI: Allow waiting for job changes · 793a8f7c

Michael Hanselmann authored 14 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

793a8f7c

Feb 22, 2010

Show message when job is waiting in queue or for locks · f4484122

Michael Hanselmann authored 15 years ago


Jobs submitted via the standard command line utilities didn't give any
indication that anything is happening while they were waiting in the job
queue (e.g. due to other jobs using all worker threads) or acquiring
locks. This could be very confusing for people not familiar with Ganeti's
architecture. Now they'll show a message after the first WaitForJobChanges
timeout.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f4484122

Handle EAGAIN in LUXI client · cb462b06

Michael Hanselmann authored 15 years ago


If too many clients try to connect to the master at the same time, some of
them might fail if the master doesn't accept the connections fast enough.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

cb462b06

Jan 22, 2010

Factorize LUXI parsing and handling code · 231db3a5

Michael Hanselmann authored 15 years ago


Also fix a typo in http/__init__.py and add unittests
for the LUXI parsing and formatting functions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

231db3a5

Simplify LUXI exceptions · 797506fc

Michael Hanselmann authored 15 years ago


Having only one exception hierarchy makes catching them simpler. Before
ProtocolError would derive directly from Exception, but with this patch
it'll also be in the hierarchy defined by the ganeti.errors module.
Separating encoding and decoding errors is not necessary at this point
as they're never handled separately, and merging them removes a few
lines from the code.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

797506fc

Jan 05, 2010

Introduce a Luxi call for GetTags · 7699c3af

Iustin Pop authored 15 years ago


This changes from submitting jobs to get the tags (in cli scripts) to
queries, which (since the tags query is a cheap one) should be much
faster.

The tags queries are already done without locks (in the generic query
paths for instances/nodes/cluster), so this shouldn't break tags query
via gnt-* list-tags.

On a small cluster, the runtime of gnt-cluster/gnt-instance list tags
more than halves; on a big cluster (with many MCs) I expect it to be
more than 5 times faster. The speed of the tags get is not the main
gain, it is eliminating a job when a simple query is enough.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

7699c3af

Jan 04, 2010

Add targeted pylint disables · 7260cfbe

Iustin Pop authored 15 years ago


This patch should have only:

- pylint disables
- docstring changes
- whitespace changes

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

7260cfbe

Oct 13, 2009
- luxi: Pass socket path directly to exception, not in tuple · 63d96e4c
  Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
  63d96e4c
Sep 25, 2009

Move the luxi error handling into errors.py · a6607331

Iustin Pop authored 15 years ago


Currently the luxi error handling is hardcoded as special encoding on
the masterd-side and special decoding on the client side. This patch
moves it to errors.py such that other parts of the code can reuse the
same encoding.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 6956e9cd)

a6607331

Aug 27, 2009

Move the luxi error handling into errors.py · 6956e9cd

Iustin Pop authored 15 years ago


Currently the luxi error handling is hardcoded as special encoding on
the masterd-side and special decoding on the client side. This patch
moves it to errors.py such that other parts of the code can reuse the
same encoding.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6956e9cd

Aug 26, 2009

Add file to pause watcher for a certain duration · 05e50653

Michael Hanselmann authored 15 years ago


This can be used during maintenance work.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

05e50653

Jul 19, 2009

Add a luxi call for multi-job submit · 56d8ff91

Iustin Pop authored 15 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
(cherry picked from commit 2971c913)

56d8ff91

Jul 07, 2009

Fix pylint warnings · 7c4d6c7b

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7c4d6c7b

Fix problem with EAGAIN on socket connection in clients · 6096ee13

Michael Hanselmann authored 15 years ago

If a user used ^Z to stop the program, poll() in socket.recv would return
EAGAIN due to SIGSTOP. This patch changes luxi.Transport.Recv to ignore EAGAIN.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6096ee13

Fix some typos · 5bbd3f7f

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5bbd3f7f

May 21, 2009

Add a luxi call for multi-job submit · 2971c913

Iustin Pop authored 15 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

2971c913

Feb 04, 2009

Add one new luxi query: cluster info · 66baeccc

Iustin Pop authored 16 years ago

This is the last query that RAPI executes via opcodes and is purely
static (config values only). As such, we can convert it safely to a
query instead of job.

Reviewed-by: imsnah

66baeccc

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 22, 2009

luxi: close and reopen the socket on errors · 8d5b316c

Iustin Pop authored 16 years ago

This is less of an actual issue for regular gnt-* clients, but it's
easily reproducible with burnin and possible with RAPI (depending on how
the program uses luxi.Client(s)).

In case of burnin, if we interrupt the client (^C) while it polls the
job, it will abort and raise an error. After that, burnin issues a
remove instance job, and at this point, we send the submit job (remove)
call but the first thing we read from the socket will be the response to
the previous poll job request, since that was queued already from the
master.

To solve this, whenever we detect an error in Transport.Call(), we close
that transport and re-create a new one, to start anew. The other
alternative would be to introduce a sequence to the protocol, but this
is something that would be design-level change and it's not recommended
at this stage.

Reviewed-by: imsnah

8d5b316c

Jan 20, 2009
- Fix a typo in luxi's docstring · 7577196d
  Guido Trotter authored 16 years ago
```
Reviewed-by: iustinp
```
  7577196d
Dec 18, 2008

Prevent RPC timeout on auto-archiving jobs · f8ad5591

Michael Hanselmann authored 16 years ago

With a large job queue, auto-archiving jobs can take a very long time,
causing timeouts on the luxi RPC layer. With this change, auto-
archive returns after half of the RPC timeout has passed. The user
will see how many jobs are left unchecked.

Reviewed-by: ultrotter

f8ad5591

Oct 16, 2008

Add an interface for the drain flag changes/query · 3ccafd0e

Iustin Pop authored 16 years ago

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain
$ gnt-cluster queue info
The drain flag is set
$ gnt-cluster queue undrain
$ gnt-cluster queue info
The drain flag is unset

The choice of making the setting via luxi and not an opcode is that
opcodes can't be executed when drained, but we don't query via luxi
since in the future it might become a cluster property as opposed to a
node one.

Reviewed-by: imsnah

3ccafd0e

Oct 15, 2008

Implement transport of ganeti errors across luxi · 6797ec29

Iustin Pop authored 16 years ago

This patch adds a generic method to identify the ganeti error given its
class name, and implements this across the luxi protocol.

Reviewed-by: imsnah

6797ec29

Oct 06, 2008

Implement job auto-archiving · 07cd723a

Iustin Pop authored 16 years ago

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

07cd723a

Oct 01, 2008

Add new query to get cluster config values · ae5849b5

Michael Hanselmann authored 16 years ago

This can be used to retrieve certain cluster config values from
within clients.

OpDumpClusterConfig was not used anywhere, hence I'm just reusing
it. The way ConfigWriter.DumpConfig returned the configuration
was not thread-safe, anyway (no deepcopy).

Reviewed-by: iustinp

ae5849b5

Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 28, 2008
- Fix error message when masterd is not listening · 082c5adb
  Michael Hanselmann authored 16 years ago
```
Reported by Iustin.

Reviewed-by: iustinp
```
  082c5adb