- Oct 17, 2008
-
-
Guido Trotter authored
- remove now unused osdev and swapdev arguments from backend, noded, rpc, cmdlib - convert docstrings to epydoc Reviewed-by: iustinp
-
Oleksiy Mishchenko authored
Reviewed-by: imsnah
-
- Oct 16, 2008
-
-
Michael Hanselmann authored
Requests are no longer logged to a separate file. Reviewed-by: amishchenko
-
Iustin Pop authored
In order to account for future improvements to master failover, we move the actual data gathering capabilities from ganeti-masterd into bootstrap.py, and we leave only the verification into masterd. The verification procedure is then changed to retry multiple times (up to one minute) in case most nodes do not respond, and also the algorithm is changed to require at least half (but not half+1) votes, since our vote also should count (and we vote for ourselves). Example for consistent (config-wise) cluster: - 5 node cluster, 2 nodes down: still start - 4 node cluster, 2 nodes down: retry for one minute, abort Reviewed-by: ultrotter
-
Iustin Pop authored
This adds the set/reset in the jqueue and luxi modules, and a way to query it in OpQueryConfigValues, and also the comand line interface for it: $ gnt-cluster queue info The drain flag is unset $ gnt-cluster queue drain $ gnt-cluster queue info The drain flag is set $ gnt-cluster queue undrain $ gnt-cluster queue info The drain flag is unset The choice of making the setting via luxi and not an opcode is that opcodes can't be executed when drained, but we don't query via luxi since in the future it might become a cluster property as opposed to a node one. Reviewed-by: imsnah
-
- Oct 15, 2008
-
-
Iustin Pop authored
A new multi-node call is added that sets/resets the drain flag. Reviewed-by: imsnah
-
Iustin Pop authored
This patch adds a generic method to identify the ganeti error given its class name, and implements this across the luxi protocol. Reviewed-by: imsnah
-
Michael Hanselmann authored
Reviewed-by: ultrotter
-
- Oct 14, 2008
-
-
Iustin Pop authored
The newly-added node-specific ValidateParams hypervisor method is exported over RPC, using the semi-standard (success, message) return value. Multi-node call, so that we call on both primary and secondary at once. Reviewed-by: ultrotter
-
- Oct 13, 2008
-
-
Iustin Pop authored
This fixes: - whitespace change, double lines between methods - duplication of call_upload_file, introduced by mistake in rev 1795 and which went undetected because of the many changes in that ref (only diff -b shows it clearly) - call_instance_info didn't pass the hypervisor name parameter, but the backend requires it Reviewed-by: ultrotter
-
- Oct 12, 2008
-
-
Iustin Pop authored
Currently, we check if we have a given ip address (i.e. it's alive on one of our interfaces) but manually calling TcpPing(source=localhost). This works, but having it spread all over the code makes it hard to change the implementation. The patch abstracts this into a separate utils.OwnIpAddress(addr) function. We add a rpc call for it, which we use instead of the (single-use of) call_node_tcp_ping. We leave node_tcp_ping in, as seems useful and eventually it should be removed in a separate patch. Reviewed-by: imsnah
-
- Oct 10, 2008
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Iustin Pop authored
This big patch changes the call model used in internode-rpc from standalong function calls in the rpc module to via a RpcRunner class, that holds all the methods. This can be used in the future to enable smarter processing in the RPC layer itself (some quick examples are not setting the DiskID from cmdlib code, but only once in each rpc call, etc.). There are a few RPC calls that are made outside of the LU code, and these calls are left as staticmethods, so they can be used without a class instance (which requires a ConfigWriter instance). Reviewed-by: imsnah
-
- Oct 08, 2008
-
-
Iustin Pop authored
This (big) patch moves the hypervisor type from the cluster to the instance level; the cluster attribute remains as the default hypervisor, and will be renamed accordingly in a next patch. The cluster also gains the ‘enable_hypervisors’ attribute, and instances can be created with any of the enabled ones (no provision yet for changing that attribute). The many many changes in the rpc/backend layer are due to the fact that all backend code read the hypervisor from the local copy of the config, and now we have to send it (either in the instance object, or as a separate parameter) for each function. The node list by default will list the node free/total memory for the default hypervisor, a new flag to it should exist to select another hypervisor. Instance list has a new field, hypervisor, that shows the instance hypervisor. Cluster verify runs for all enabled hypervisor types. The new FIXMEs are related to IAllocator, since now the node total/free/used memory counts are wrong (we can't reliably compute the free memory). Reviewed-by: imsnah
-
- Oct 07, 2008
-
-
Iustin Pop authored
Currently the call_instance_migrate call only passes the instance name; we need to pass the whole object for the hypervisor_type changes (all the other individual instance rpc calls already pass the instance object). Reviewed-by: imsnah
-
Iustin Pop authored
Background: when we have multiple jobs in the queue (more than just a few), many of the jobs (up to the number of threads) will be in state 'running', although many of them could be actually blocked, waiting for some locks. This is not good, as one cannot easily see what is happening. The patch extends the opcode/job possible statuses with another one, waiting, which shows that the LU is in the acquire locks phase. The mechanism for doing so is simple, we initialize (in the job queue) the opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give control to the LU's Exec, it will call a notifier back into the _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with the proper queue locking). Because this mechanism does not save the job, all opcodes on disk will be in status WAITLOCK and not RUNNING anymore, so we also change the load sequence to consider WAITLOCK as RUNNING. With the patch applied, creating in parallel (via burnin) five instances on a five node cluster shows that only two are executing, while three are waiting for locks. Reviewed-by: imsnah
-
- Oct 06, 2008
-
-
Iustin Pop authored
This patch adds a new luxi call that implements auto-archiving of jobs older than a certain age (or -1 for all completed jobs), and the gnt-job command that makes use of this (with 'all' for -1). Reviewed-by: imsnah
-
Iustin Pop authored
Currently there are three function in backend that need the cluster name in order to instantiate an SshRunner. The patch changes these to get the cluster name from the master in the rpc call; once the multi-hypervisor change is implemented, then very few places in which we need the SCR remain in the backend. Reviewed-by: killerfoxi, imsnah
-
- Oct 01, 2008
-
-
Michael Hanselmann authored
Use simpleconfig instead of ssconf. Reviewed-by: iustinp
-
Michael Hanselmann authored
Use RPC calls instead of ssconf. Reviewed-by: iustinp
-
Michael Hanselmann authored
Replace ssconf with utility functions. Reviewed-by: iustinp
-
Michael Hanselmann authored
This can be used to retrieve certain cluster config values from within clients. OpDumpClusterConfig was not used anywhere, hence I'm just reusing it. The way ConfigWriter.DumpConfig returned the configuration was not thread-safe, anyway (no deepcopy). Reviewed-by: iustinp
-
Iustin Pop authored
The watcher didn't handle the down nodes, fix this by ignoring (in secondary node reboot checks) any node that doesn't return a boot id. Reviewed-by: imsnah
-
Iustin Pop authored
The watcher was using conflicting attributes of the instance: - it queried the admin_/oper_state, which are booleans - but it compared those to the status (which is a text field) The code was changed to query the aggregated 'status' field, as that will also return indication of node problems, and we can use this only one field for all decisions. We still ask for the admin_state field as that is needed for the activate disks check (in secondary node restart). The patch also touches the watcher in some other parts: - log exceptions nicer - convert a method to @staticmethod - remove unused imports Reviewed-by: imsnah
-
Iustin Pop authored
The watcher has one last use of ganeti commands as opposed to sending requests via luxi. The patch changes this to use the cli functions. The patch also has two other changes: - fix the docstring for OpVerifyDisks (found out while converting this) - enable stderr logging on the watcher when “-d” is passes Reviewed-by: imsnah
-
- Sep 09, 2008
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Iustin Pop authored
This is an initial version of the master startup checks. It's a very rudimentary change, however in normal usage (an old master was started, the rest of the cluster is functioning normally) it will succeed in preventing wrong startups. Reviewed-by: imsnah
-
Iustin Pop authored
We create a multi-node call so that querying all nodes for agreement will be fast. Reviewed-by: imsnah
-
Michael Hanselmann authored
This helps to prevent complete deadlocks. Reviewed-by: iustinp
-
- Sep 05, 2008
-
-
Michael Hanselmann authored
Only one process should modify the queue at the same time. Reviewed-by: iustinp
-
- Aug 29, 2008
-
-
Iustin Pop authored
This patch alters the WaitForJobChanges luxi-RPC call to have a configurable timeout, so that the call behaves nicely with long jobs that have no update. We do this by adding a timeout parameter in the RPC call, and returning a special constant when the timeout is reached without an update. The luxi client will repeatedly call the WaitForJobChanges until it gets a real change. The timeout is hardcoded as half the RWTO value. The patch also removes an unused variable (new_state) from the WaitForJobChanges method. Reviewed-by: imsnah,ultrotter
-
- Aug 27, 2008
-
-
Michael Hanselmann authored
This is a large patch, but I can't figure out how to split it without breaking stuff. The old way of getting messages by always getting the last one didn't bring all messages to the client if they were added too fast, thereby making commands like “gnt-cluster verify” less than useful. These changes now introduce some sort a serial number per log entry to keep track what message a client already received. They also remove the log lock per opcode to make reading log entries thread safe. Reviewed-by: ultrotter
-
- Aug 18, 2008
-
-
Michael Hanselmann authored
By using this Linux-specific way we don't have to care about removing the socket file when quitting or starting (after an unclean shutdown). For a more detailed description, see the comment in the patch. Reviewed-by: schreiberal
-
- Aug 11, 2008
-
-
Michael Hanselmann authored
This way clients can react faster to status or message changes and don't have to poll anymore. Reviewed-by: ultrotter
-
- Aug 08, 2008
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
This will be used to archive jobs. Reviewed-by: iustinp
-
Michael Hanselmann authored
The lock will also be needed by another function. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
jobqueue_update: Uploads a job queue file's content to a node. The most common operation is to upload something that we already have in a string. Unlike in the upload_file function, the file is not read again when distributing changes, but content has to be passed as a string. jobqueue_purge: Removes all queue related files from a node. Reviewed-by: iustinp
-