Commits · cf6fee174740c3d88186db688ec556f4f08307fe · itminedu / snf-ganeti

Feb 22, 2010

Show message when job is waiting in queue or for locks · f4484122

Michael Hanselmann authored 15 years ago


Jobs submitted via the standard command line utilities didn't give any
indication that anything is happening while they were waiting in the job
queue (e.g. due to other jobs using all worker threads) or acquiring
locks. This could be very confusing for people not familiar with Ganeti's
architecture. Now they'll show a message after the first WaitForJobChanges
timeout.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

f4484122

Handle EAGAIN in LUXI client · cb462b06

Michael Hanselmann authored 15 years ago


If too many clients try to connect to the master at the same time, some of
them might fail if the master doesn't accept the connections fast enough.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

cb462b06

Jan 22, 2010

Factorize LUXI parsing and handling code · 231db3a5

Michael Hanselmann authored 15 years ago


Also fix a typo in http/__init__.py and add unittests
for the LUXI parsing and formatting functions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

231db3a5

Simplify LUXI exceptions · 797506fc

Michael Hanselmann authored 15 years ago


Having only one exception hierarchy makes catching them simpler. Before
ProtocolError would derive directly from Exception, but with this patch
it'll also be in the hierarchy defined by the ganeti.errors module.
Separating encoding and decoding errors is not necessary at this point
as they're never handled separately, and merging them removes a few
lines from the code.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

797506fc

Jan 05, 2010

Introduce a Luxi call for GetTags · 7699c3af

Iustin Pop authored 15 years ago


This changes from submitting jobs to get the tags (in cli scripts) to
queries, which (since the tags query is a cheap one) should be much
faster.

The tags queries are already done without locks (in the generic query
paths for instances/nodes/cluster), so this shouldn't break tags query
via gnt-* list-tags.

On a small cluster, the runtime of gnt-cluster/gnt-instance list tags
more than halves; on a big cluster (with many MCs) I expect it to be
more than 5 times faster. The speed of the tags get is not the main
gain, it is eliminating a job when a simple query is enough.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

7699c3af

Jan 04, 2010

Add targeted pylint disables · 7260cfbe

Iustin Pop authored 15 years ago


This patch should have only:

- pylint disables
- docstring changes
- whitespace changes

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Olivier Tharan <olive@google.com>

7260cfbe

Oct 13, 2009
- luxi: Pass socket path directly to exception, not in tuple · 63d96e4c
  Michael Hanselmann authored 15 years ago
```
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
```
  63d96e4c
Sep 25, 2009

Move the luxi error handling into errors.py · a6607331

Iustin Pop authored 15 years ago


Currently the luxi error handling is hardcoded as special encoding on
the masterd-side and special decoding on the client side. This patch
moves it to errors.py such that other parts of the code can reuse the
same encoding.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 6956e9cd)

a6607331

Aug 27, 2009

Move the luxi error handling into errors.py · 6956e9cd

Iustin Pop authored 15 years ago


Currently the luxi error handling is hardcoded as special encoding on
the masterd-side and special decoding on the client side. This patch
moves it to errors.py such that other parts of the code can reuse the
same encoding.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

6956e9cd

Aug 26, 2009

Add file to pause watcher for a certain duration · 05e50653

Michael Hanselmann authored 15 years ago


This can be used during maintenance work.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

05e50653

Jul 19, 2009

Add a luxi call for multi-job submit · 56d8ff91

Iustin Pop authored 15 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
(cherry picked from commit 2971c913)

56d8ff91

Jul 07, 2009

Fix pylint warnings · 7c4d6c7b

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

7c4d6c7b

Fix problem with EAGAIN on socket connection in clients · 6096ee13

Michael Hanselmann authored 15 years ago

If a user used ^Z to stop the program, poll() in socket.recv would return
EAGAIN due to SIGSTOP. This patch changes luxi.Transport.Recv to ignore EAGAIN.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

6096ee13

Fix some typos · 5bbd3f7f

Michael Hanselmann authored 15 years ago


Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

5bbd3f7f

May 21, 2009

Add a luxi call for multi-job submit · 2971c913

Iustin Pop authored 15 years ago


As a workaround for the job submit timeouts that we have, this patch
adds a new luxi call for multi-job submit; the advantage is that all the
jobs are added in the queue and only after the workers can start
processing them.

This is definitely faster than per-job submit, where the submission of
new jobs competes with the workers processing jobs.

On a pure no-op OpDelay opcode (not on master, not on nodes), we have:
  - 100 jobs:
    - individual: submit time ~21s, processing time ~21s
    - multiple:   submit time 7-9s, processing time ~22s
  - 250 jobs:
    - individual: submit time ~56s, processing time ~57s
                  run 2:      ~54s                  ~55s
    - multiple:   submit time ~20s, processing time ~51s
                  run 2:      ~17s                  ~52s

which shows that we indeed gain on the client side, and maybe even on
the total processing time for a high number of jobs. For just 10 or so I
expect the difference to be just noise.

This will probably require increasing the timeout a little when
submitting too many jobs - 250 jobs at ~20 seconds is close to the
current rw timeout of 60s.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

2971c913

Feb 04, 2009

Add one new luxi query: cluster info · 66baeccc

Iustin Pop authored 16 years ago

This is the last query that RAPI executes via opcodes and is purely
static (config values only). As such, we can convert it safely to a
query instead of job.

Reviewed-by: imsnah

66baeccc

Implement lockless query operations · ec79568d

Iustin Pop authored 16 years ago

This patch adds the framework for, and enables lockless OpQueryInstances. This
means that instances will be shown in ERROR_up or ERROR_down state, even though
this is not an error (but just an in-progress job).

The framework is implemented as follows:
  - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take
    an additional “use_locking” flag which will denote whether to lock
    or not; this patch only implements this for LUQueryInstances
  - the luxi query functions take an additional argument use_locking
    which is passed to the master daemon, and then passed to the above
    opcodes
  - cli.py export a new SYNC_OPT command line options which implement
    setting this flag to true
  - except for gnt-instance list, which uses this option, and for
    name-only queries (e.g. QueryNodes(fields=["names"])), all other
    callers are setting this flag to True
  - RAPI also sets the flag to True

The patch was tested with a continuous (0.2s sleep in-between)
gnt-instance list during a burnin, and no problems were observed.

Reviewed-by: ultrotter

ec79568d

Jan 22, 2009

luxi: close and reopen the socket on errors · 8d5b316c

Iustin Pop authored 16 years ago

This is less of an actual issue for regular gnt-* clients, but it's
easily reproducible with burnin and possible with RAPI (depending on how
the program uses luxi.Client(s)).

In case of burnin, if we interrupt the client (^C) while it polls the
job, it will abort and raise an error. After that, burnin issues a
remove instance job, and at this point, we send the submit job (remove)
call but the first thing we read from the socket will be the response to
the previous poll job request, since that was queued already from the
master.

To solve this, whenever we detect an error in Transport.Call(), we close
that transport and re-create a new one, to start anew. The other
alternative would be to introduce a sequence to the protocol, but this
is something that would be design-level change and it's not recommended
at this stage.

Reviewed-by: imsnah

8d5b316c

Jan 20, 2009
- Fix a typo in luxi's docstring · 7577196d
  Guido Trotter authored 16 years ago
```
Reviewed-by: iustinp
```
  7577196d
Dec 18, 2008

Prevent RPC timeout on auto-archiving jobs · f8ad5591

Michael Hanselmann authored 16 years ago

With a large job queue, auto-archiving jobs can take a very long time,
causing timeouts on the luxi RPC layer. With this change, auto-
archive returns after half of the RPC timeout has passed. The user
will see how many jobs are left unchecked.

Reviewed-by: ultrotter

f8ad5591

Oct 16, 2008

Add an interface for the drain flag changes/query · 3ccafd0e

Iustin Pop authored 16 years ago

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain
$ gnt-cluster queue info
The drain flag is set
$ gnt-cluster queue undrain
$ gnt-cluster queue info
The drain flag is unset

The choice of making the setting via luxi and not an opcode is that
opcodes can't be executed when drained, but we don't query via luxi
since in the future it might become a cluster property as opposed to a
node one.

Reviewed-by: imsnah

3ccafd0e

Oct 15, 2008

Implement transport of ganeti errors across luxi · 6797ec29

Iustin Pop authored 16 years ago

This patch adds a generic method to identify the ganeti error given its
class name, and implements this across the luxi protocol.

Reviewed-by: imsnah

6797ec29

Oct 06, 2008

Implement job auto-archiving · 07cd723a

Iustin Pop authored 16 years ago

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

07cd723a

Oct 01, 2008

Add new query to get cluster config values · ae5849b5

Michael Hanselmann authored 16 years ago

This can be used to retrieve certain cluster config values from
within clients.

OpDumpClusterConfig was not used anywhere, hence I'm just reusing
it. The way ConfigWriter.DumpConfig returned the configuration
was not thread-safe, anyway (no deepcopy).

Reviewed-by: iustinp

ae5849b5

Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 28, 2008
- Fix error message when masterd is not listening · 082c5adb
  Michael Hanselmann authored 16 years ago
```
Reported by Iustin.

Reviewed-by: iustinp
```
  082c5adb
Aug 27, 2008

Make sure that client programs get all messages · 6c5a7090

Michael Hanselmann authored 16 years ago

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added
too fast, thereby making commands like “gnt-cluster verify” less than
useful. These changes now introduce some sort a serial number per
log entry to keep track what message a client already received. They
also remove the log lock per opcode to make reading log entries thread
safe.

Reviewed-by: ultrotter

6c5a7090

Aug 11, 2008

Add RPC call to wait for job changes · dfe57c22

Michael Hanselmann authored 16 years ago

This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter

dfe57c22

Aug 08, 2008
- Add query function for exports · 32f93223
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  32f93223
Aug 06, 2008
- Implement query for nodes · 02f7fe54
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  02f7fe54
- Implement query for instances · ee6c7b94
  Michael Hanselmann authored 16 years ago
```
Queries don't create jobs and are more efficient. Log messages
are not yet stored anywhere.

Reviewed-by: iustinp
```
  ee6c7b94
Jul 30, 2008

Fix pylint-detected issues · 38206f3c

Iustin Pop authored 16 years ago

This is mostly:
  - whitespace fix (space at EOL in some files, not all, broken
    indentation, etc)
  - variable names overriding others (one is a real bug in there)
  - too-long-lines
  - cleanup of most unused imports (not all)

Reviewed-by: ultrotter

38206f3c

Jul 09, 2008

Remove old job queue code · 2467e0d3
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
2467e0d3

Change masterd/client RPC protocol · 0bbe448c

Michael Hanselmann authored 16 years ago

- Introduce abstraction class on client side
- Use constants for method names
- Adopt legacy function SubmitOpCode to use it

Reviewed-by: iustinp

0bbe448c

Make luxi RPC more flexible · 3d8548c4

Michael Hanselmann authored 16 years ago

- Use constants for dict entries
- Handle exceptions on server side
- Rename client function to CallMethod to match server side naming

Reviewed-by: iustinp

3d8548c4

Jul 08, 2008
- luxi: Use serializer module instead of simplejson · fad50141
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  fad50141
Jun 21, 2008

Implement handling of luxi errors in cli.py · 03a8dbdc

Iustin Pop authored 16 years ago

Currently the generic handling of ganeti errors in cli.py (GenericMain
and FormatError) only handles the core ganeti errors, and not the client
protocol errors (which live in a separate hierarchy).

This patch adds handling of luxi errors too, and also adds another luxi
error for the case when the master is not running. This gives us a nice:

  gnta1:~# gnt-node list
  Cannot communicate with the master daemon.
  Is it running and listening on '/var/run/ganeti-master.sock'?

error message instead of a traceback.

Reviewed-by: amishchenko

03a8dbdc

Apr 10, 2008

Change client protocol to raise exception on failures · b77acb3e

Iustin Pop authored 16 years ago

Currently the luxi.client.SubmitJob and Query methods return the unserialized
result without processing it at all. This patch changes this by adding a
'RequestException' error that is raised if the query itself or the
submission of the job failed, and (if not) returning only the 'result'
field from the message.

The patch also does processing on the result of a query if we queried
for jobs, as the 'op_list' field in the result has serialized opcodes
and we need the de-serialized.

Reviewed-by: ultrotter

b77acb3e

Apr 07, 2008

Move some checks from cli.py to luxi.py · a14a17fc

Iustin Pop authored 16 years ago

The idea of cli.py and luxi.py is that all protocol checks should be in
luxi, and cli.py should just offer some helpful shortcuts for the
command line scripts.

This patch removes the result checks from cli and adds some other checks
to luxi. It does no longer check the success/failure since it's not yet
clear how that should be handled - probably exceptions.

Reviewed-by: ultrotter

a14a17fc

Apr 01, 2008

Add submit function to lib/cli.py · ceab32dd

Iustin Pop authored 16 years ago

This patch adds function that submit jobs or queries over the unix socket
interface to lib/cli.py. The will be used by the scripts instead of the
SubmitOpCode function.

Reviewed-by: ultrotter

ceab32dd