Commits · 263ab7cf73bdbfe5becdf3c5cf6335ff80c7becc · itminedu / snf-ganeti

Oct 17, 2008
- Cleanup os_add/rename rpc for OS API 10 · d15a9ad3
  Guido Trotter authored 16 years ago
```
- remove now unused osdev and swapdev arguments from backend, noded,
  rpc, cmdlib
- convert docstrings to epydoc

Reviewed-by: iustinp
```
  d15a9ad3
- ETag passing support. · 713faea6
  Oleksiy Mishchenko authored 16 years ago
```
Reviewed-by: imsnah
```
  713faea6
Oct 16, 2008

rapi: Convert to new HTTP server class · 16a8967d
Michael Hanselmann authored 16 years ago
```
Requests are no longer logged to a separate file.

Reviewed-by: amishchenko
```
16a8967d

Improvements to the master startup checks · d7cdb55d

Iustin Pop authored 16 years ago

In order to account for future improvements to master failover, we move
the actual data gathering capabilities from ganeti-masterd into
bootstrap.py, and we leave only the verification into masterd.

The verification procedure is then changed to retry multiple times (up
to one minute) in case most nodes do not respond, and also the algorithm
is changed to require at least half (but not half+1) votes, since our
vote also should count (and we vote for ourselves).

Example for consistent (config-wise) cluster:
  - 5 node cluster, 2 nodes down: still start
  - 4 node cluster, 2 nodes down: retry for one minute, abort

Reviewed-by: ultrotter

d7cdb55d

Add an interface for the drain flag changes/query · 3ccafd0e

Iustin Pop authored 16 years ago

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain
$ gnt-cluster queue info
The drain flag is set
$ gnt-cluster queue undrain
$ gnt-cluster queue info
The drain flag is unset

The choice of making the setting via luxi and not an opcode is that
opcodes can't be executed when drained, but we don't query via luxi
since in the future it might become a cluster property as opposed to a
node one.

Reviewed-by: imsnah

3ccafd0e

Oct 15, 2008
- Add a rpc call for changing the drain flag · 5d672980
  Iustin Pop authored 16 years ago
```
A new multi-node call is added that sets/resets the drain flag.

Reviewed-by: imsnah
```
  5d672980
- Implement transport of ganeti errors across luxi · 6797ec29
  Iustin Pop authored 16 years ago
```
This patch adds a generic method to identify the ganeti error given its
class name, and implements this across the luxi protocol.

Reviewed-by: imsnah
```
  6797ec29
- rapi: Whitespace fixes · a2f92677
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: ultrotter
```
  a2f92677
Oct 14, 2008

Export the hypervisor.ValidateParameters over RPC · 6217e295

Iustin Pop authored 16 years ago

The newly-added node-specific ValidateParams hypervisor method is
exported over RPC, using the semi-standard (success, message) return
value. Multi-node call, so that we call on both primary and secondary at
once.

Reviewed-by: ultrotter

6217e295

Oct 13, 2008

Fix a few rpc-related errors · 16ad1a83

Iustin Pop authored 16 years ago

This fixes:
  - whitespace change, double lines between methods
  - duplication of call_upload_file, introduced by mistake in rev 1795
    and which went undetected because of the many changes in that ref
    (only diff -b shows it clearly)
  - call_instance_info didn't pass the hypervisor name parameter, but
    the backend requires it

Reviewed-by: ultrotter

16ad1a83

Oct 12, 2008

Abstract checking own address into a function · caad16e2

Iustin Pop authored 16 years ago

Currently, we check if we have a given ip address (i.e. it's alive on
one of our interfaces) but manually calling TcpPing(source=localhost).
This works, but having it spread all over the code makes it hard to
change the implementation.

The patch abstracts this into a separate utils.OwnIpAddress(addr)
function. We add a rpc call for it, which we use instead of the
(single-use of) call_node_tcp_ping. We leave node_tcp_ping in, as seems
useful and eventually it should be removed in a separate patch.

Reviewed-by: imsnah

caad16e2

Oct 10, 2008

Convert ganeti-noded to new HTTP server class · cc28af80
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
cc28af80

Convert rpc module to RpcRunner · 72737a7f

Iustin Pop authored 16 years ago

This big patch changes the call model used in internode-rpc from
standalong function calls in the rpc module to via a RpcRunner class,
that holds all the methods. This can be used in the future to enable
smarter processing in the RPC layer itself (some quick examples are not
setting the DiskID from cmdlib code, but only once in each rpc call,
etc.).

There are a few RPC calls that are made outside of the LU code, and
these calls are left as staticmethods, so they can be used without a
class instance (which requires a ConfigWriter instance).

Reviewed-by: imsnah

72737a7f

Oct 08, 2008

Move the hypervisor attribute to the instances · e69d05fd

Iustin Pop authored 16 years ago

This (big) patch moves the hypervisor type from the cluster to the
instance level; the cluster attribute remains as the default hypervisor,
and will be renamed accordingly in a next patch. The cluster also gains
the ‘enable_hypervisors’ attribute, and instances can be created with
any of the enabled ones (no provision yet for changing that attribute).

The many many changes in the rpc/backend layer are due to the fact that
all backend code read the hypervisor from the local copy of the config,
and now we have to send it (either in the instance object, or as a
separate parameter) for each function.

The node list by default will list the node free/total memory for the
default hypervisor, a new flag to it should exist to select another
hypervisor. Instance list has a new field, hypervisor, that shows the
instance hypervisor. Cluster verify runs for all enabled hypervisor
types.

The new FIXMEs are related to IAllocator, since now the node
total/free/used memory counts are wrong (we can't reliably compute the
free memory).

Reviewed-by: imsnah

e69d05fd

Oct 07, 2008

rpc.call_instance_migrate: pass the whole instance · 9f0e6b37

Iustin Pop authored 16 years ago

Currently the call_instance_migrate call only passes the instance name;
we need to pass the whole object for the hypervisor_type changes (all
the other individual instance rpc calls already pass the instance
object).

Reviewed-by: imsnah

9f0e6b37

Implement job 'waiting' status · e92376d7

Iustin Pop authored 16 years ago

Background: when we have multiple jobs in the queue (more than just a
few), many of the jobs (up to the number of threads) will be in state
'running', although many of them could be actually blocked, waiting for
some locks. This is not good, as one cannot easily see what is
happening.

The patch extends the opcode/job possible statuses with another one,
waiting, which shows that the LU is in the acquire locks phase. The
mechanism for doing so is simple, we initialize (in the job queue) the
opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give
control to the LU's Exec, it will call a notifier back into the
_JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with
the proper queue locking). Because this mechanism does not save the job,
all opcodes on disk will be in status WAITLOCK and not RUNNING anymore,
so we also change the load sequence to consider WAITLOCK as RUNNING.

With the patch applied, creating in parallel (via burnin) five instances
on a five node cluster shows that only two are executing, while three
are waiting for locks.

Reviewed-by: imsnah

e92376d7

Oct 06, 2008

Implement job auto-archiving · 07cd723a

Iustin Pop authored 16 years ago

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

07cd723a

backend.py change to get cluster name from master · 62c9ec92

Iustin Pop authored 16 years ago

Currently there are three function in backend that need the cluster name
in order to instantiate an SshRunner. The patch changes these to get the
cluster name from the master in the rpc call; once the multi-hypervisor
change is implemented, then very few places in which we need the SCR
remain in the backend.

Reviewed-by: killerfoxi, imsnah

62c9ec92

Oct 01, 2008

Convert ganeti-master · a42872ff
Michael Hanselmann authored 16 years ago
```
Use simpleconfig instead of ssconf.

Reviewed-by: iustinp
```
a42872ff
Convert ganeti-watcher · 2859b87b
Michael Hanselmann authored 16 years ago
```
Use RPC calls instead of ssconf.

Reviewed-by: iustinp
```
2859b87b
Convert ganeti-noded · 8594f271
Michael Hanselmann authored 16 years ago
```
Replace ssconf with utility functions.

Reviewed-by: iustinp
```
8594f271

Add new query to get cluster config values · ae5849b5

Michael Hanselmann authored 16 years ago

This can be used to retrieve certain cluster config values from
within clients.

OpDumpClusterConfig was not used anywhere, hence I'm just reusing
it. The way ConfigWriter.DumpConfig returned the configuration
was not thread-safe, anyway (no deepcopy).

Reviewed-by: iustinp

ae5849b5

Fix the watcher with down nodes · 37b77b18

Iustin Pop authored 16 years ago

The watcher didn't handle the down nodes, fix this by ignoring (in
secondary node reboot checks) any node that doesn't return a boot id.

Reviewed-by: imsnah

37b77b18

Fix the watcher not restarting instance bug · b7309a0d

Iustin Pop authored 16 years ago

The watcher was using conflicting attributes of the instance:
  - it queried the admin_/oper_state, which are booleans
  - but it compared those to the status (which is a text field)

The code was changed to query the aggregated 'status' field, as that
will also return indication of node problems, and we can use this only
one field for all decisions. We still ask for the admin_state field as
that is needed for the activate disks check (in secondary node restart).

The patch also touches the watcher in some other parts:
  - log exceptions nicer
  - convert a method to @staticmethod
  - remove unused imports

Reviewed-by: imsnah

b7309a0d

Remove last use of utils.RunCmd from the watcher · 5188ab37

Iustin Pop authored 16 years ago

The watcher has one last use of ganeti commands as opposed to sending
requests via luxi. The patch changes this to use the cli functions.

The patch also has two other changes:
  - fix the docstring for OpVerifyDisks (found out while converting
    this)
  - enable stderr logging on the watcher when “-d” is passes

Reviewed-by: imsnah

5188ab37

Sep 09, 2008

ganeti-noded: Add constant for queue lock timeout · 8785cb30
Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
8785cb30

Implement master startup safety check · 36205981

Iustin Pop authored 16 years ago

This is an initial version of the master startup checks. It's a very
rudimentary change, however in normal usage (an old master was started,
the rest of the cluster is functioning normally) it will succeed in
preventing wrong startups.

Reviewed-by: imsnah

36205981

Export backend.GetMasterInfo over the rpc layer · 4e071d3b

Iustin Pop authored 16 years ago

We create a multi-node call so that querying all nodes for agreement
will be fast.

Reviewed-by: imsnah

4e071d3b

Use lock timeout for queue updates in ganeti-noded · 506cff12
Michael Hanselmann authored 16 years ago
```
This helps to prevent complete deadlocks.

Reviewed-by: iustinp
```
506cff12

Sep 05, 2008
- noded: Get job queue lock while purging queue content · f1f3f45c
  Michael Hanselmann authored 16 years ago
```
Only one process should modify the queue at the same time.

Reviewed-by: iustinp
```
  f1f3f45c
Aug 29, 2008

Make WaitForJobChanges deal with long jobs · 5c735209

Iustin Pop authored 16 years ago

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning
a special constant when the timeout is reached without an update. The
luxi client will repeatedly call the WaitForJobChanges until it gets a
real change. The timeout is hardcoded as half the RWTO value.

The patch also removes an unused variable (new_state) from the
WaitForJobChanges method.

Reviewed-by: imsnah,ultrotter

5c735209

Aug 27, 2008

Make sure that client programs get all messages · 6c5a7090

Michael Hanselmann authored 16 years ago

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added
too fast, thereby making commands like “gnt-cluster verify” less than
useful. These changes now introduce some sort a serial number per
log entry to keep track what message a client already received. They
also remove the log lock per opcode to make reading log entries thread
safe.

Reviewed-by: ultrotter

6c5a7090

Aug 18, 2008

Use Linux-specific way to name master socket · 9894ece7

Michael Hanselmann authored 16 years ago

By using this Linux-specific way we don't have to care about removing the
socket file when quitting or starting (after an unclean shutdown). For a
more detailed description, see the comment in the patch.

Reviewed-by: schreiberal

9894ece7

Aug 11, 2008

Add RPC call to wait for job changes · dfe57c22

Michael Hanselmann authored 16 years ago

This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter

dfe57c22

Aug 08, 2008
- Add query function for exports · 32f93223
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  32f93223
- noded: Add RPC function to rename job queue files · af5ebcb1
  Michael Hanselmann authored 16 years ago
```
This will be used to archive jobs.

Reviewed-by: iustinp
```
  af5ebcb1
- noded: Add decorator for job queue lock · 7f30777b
  Michael Hanselmann authored 16 years ago
```
The lock will also be needed by another function.

Reviewed-by: iustinp
```
  7f30777b
- Implement queue locking in node daemon · 25d6d12a
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  25d6d12a
- More logging for errors during noded RPC calls · aa9075c5
  Michael Hanselmann authored 16 years ago
```
Reviewed-by: iustinp
```
  aa9075c5
- Add job queue RPC functions · ca52cdeb
  Michael Hanselmann authored 16 years ago
```
jobqueue_update: Uploads a job queue file's content to a node. The
most common operation is to upload something that we already have
in a string. Unlike in the upload_file function, the file is not
read again when distributing changes, but content has to be passed
as a string.

jobqueue_purge: Removes all queue related files from a node.

Reviewed-by: iustinp
```
  ca52cdeb