- Aug 10, 2009
-
-
Guido Trotter authored
Signed-off-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 29, 2009
-
-
Guido Trotter authored
When the parameter is set to True and start_daemons is also True, ganeti-masterd will be started with the new --no-voting --yes-do-it options. This new option is set to True only on masterfailover, when no_voting is used. This changed the behavior from 2.0, where we didn't start the master daemon at all, when this option was used. The manpage is also updated to remove the 2.0 only change. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 25, 2009
-
-
Guido Trotter authored
With three ganeti daemons, and one or two more coming, the daemon's main function started becoming too much cut&pasted code. Collapsing most of it in a daemon.GenericMain function. Some more code could be collapsed between the two http-based daemons, but since the new daemons won't be http-based we won't do it right now. As a bonus a functionality for overriding the network port on the command line for all network based nodes is added. Signed-off-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 24, 2009
-
-
Guido Trotter authored
The <DAEMON>_PID constants were created to reference a daemon pid file, but actually contain a daemon's name, because the various functions that work with pidfiles abstract the filename from the daemon name themselves. Removing the constants and using the actual daemon name constants in their place. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
The original LOG_<DAEMON_NAME> constants for daemon logfiles are gone. In their place there is a DAEMONS_LOGFILES dict, indexed by daemon name. This is a minor change with the objective to uniform most of the daemon's main() functions code, which is very similar one to the other. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 23, 2009
-
-
Guido Trotter authored
Various modules set it to True when called in debugging mode, but the utils module supports no such global. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 19, 2009
-
-
Iustin Pop authored
As a workaround for the job submit timeouts that we have, this patch adds a new luxi call for multi-job submit; the advantage is that all the jobs are added in the queue and only after the workers can start processing them. This is definitely faster than per-job submit, where the submission of new jobs competes with the workers processing jobs. On a pure no-op OpDelay opcode (not on master, not on nodes), we have: - 100 jobs: - individual: submit time ~21s, processing time ~21s - multiple: submit time 7-9s, processing time ~22s - 250 jobs: - individual: submit time ~56s, processing time ~57s run 2: ~54s ~55s - multiple: submit time ~20s, processing time ~51s run 2: ~17s ~52s which shows that we indeed gain on the client side, and maybe even on the total processing time for a high number of jobs. For just 10 or so I expect the difference to be just noise. This will probably require increasing the timeout a little when submitting too many jobs - 250 jobs at ~20 seconds is close to the current rw timeout of 60s. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> (cherry picked from commit 2971c913)
-
- Jul 14, 2009
-
-
Guido Trotter authored
SimpleStore is a lot less heavyweight than SimpleConfigReader, and to just get the master name we can use that. This is the only usage of SimpleConfigReader currently, but we're not going to delete the class, as new usages will come in for ganeti-confd (in 2.1). Using it there, though, will make the class even more heavy to load, so it makes sense for this simple usage to be converted. Signed-off-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 08, 2009
-
-
Guido Trotter authored
When the parameter is set to True and start_daemons is also True, ganeti-masterd will be started with the new --no-voting --yes-do-it options. This new option is set to True only on masterfailover, when no_voting is used. This changed the behavior from 2.0, where we didn't start the master daemon at all, when this option was used. The manpage is also updated to remove the 2.0 only change. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
This will be used by ganeti-noded to start ganeti-masterd in a --no-voting masterfailover. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 07, 2009
-
-
Michael Hanselmann authored
If a user used ^Z to stop the program, poll() in socket.recv would return EAGAIN due to SIGSTOP. This patch changes luxi.Transport.Recv to ignore EAGAIN. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 15, 2009
-
-
Iustin Pop authored
This is used in multiple places outside cmdlib.py, so it's a more interesting patch. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 21, 2009
-
-
Iustin Pop authored
As a workaround for the job submit timeouts that we have, this patch adds a new luxi call for multi-job submit; the advantage is that all the jobs are added in the queue and only after the workers can start processing them. This is definitely faster than per-job submit, where the submission of new jobs competes with the workers processing jobs. On a pure no-op OpDelay opcode (not on master, not on nodes), we have: - 100 jobs: - individual: submit time ~21s, processing time ~21s - multiple: submit time 7-9s, processing time ~22s - 250 jobs: - individual: submit time ~56s, processing time ~57s run 2: ~54s ~55s - multiple: submit time ~20s, processing time ~51s run 2: ~17s ~52s which shows that we indeed gain on the client side, and maybe even on the total processing time for a high number of jobs. For just 10 or so I expect the difference to be just noise. This will probably require increasing the timeout a little when submitting too many jobs - 250 jobs at ~20 seconds is close to the current rw timeout of 60s. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 04, 2009
-
-
Iustin Pop authored
Currently, lib/luxi.py used lib/serializer.py for encoding/decoding messages, but the master daemon uses directly the simplejson module. This is wrong as any non-trivial change to serializer.py will break the master daemon. The patch changes masterd to use exactly the same functions as luxi.py for encoding/decoding of messages. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Apr 06, 2009
-
-
Iustin Pop authored
This patch raises an error in the master daemon in case the user requests a locking query; accordingly, all clients were modified to send only lockless queries. This is short-term fix, for proper fix the clients should be modified to submit a job when the user request a locking query. The other approach would be to ignore the flag passed by the client; this would be worse as client's wouldn't get at least an error. The possible impact of this is multiple: - some commands could have been not converted, and thus fail; this can be remedied easily - the consistency of commands is lost; e.g. node failover will not lock the node *while we get the node info*, so we could miss some data; this is again in the thread of atomic operations which are missing in the current model of query-and-act from gnt-* scripts Reviewed-by: imsnah, ultrotter
-
Iustin Pop authored
This patch will log data about queries, which are today completely invisible (at the default log level) in the master log file. Reviewed-by: imsnah
-
- Feb 27, 2009
-
-
Guido Trotter authored
Some hypervisors (KVM) need RUN_GANETI_DIR to exist even at cluster init time. This patch creates it in InitCluster just before hv parameter checking. Since the code to make list of directories is already repeated twice in the code, and this would be the third time, we abstract it into an utils.EnsureDirs function and we call that one from ganti-noded, ganeti-masterd and bootstrap. Reviewed-by: iustinp
-
- Feb 12, 2009
-
-
Iustin Pop authored
This patch introduces a 'force' mode for the master daemon startup where the voting process is not done, but the user has to confirm manually the startup (before forking, of course). Reviewed-by: imsnah
-
- Feb 04, 2009
-
-
Iustin Pop authored
This is the last query that RAPI executes via opcodes and is purely static (config values only). As such, we can convert it safely to a query instead of job. Reviewed-by: imsnah
-
Iustin Pop authored
This patch adds the framework for, and enables lockless OpQueryInstances. This means that instances will be shown in ERROR_up or ERROR_down state, even though this is not an error (but just an in-progress job). The framework is implemented as follows: - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take an additional “use_locking” flag which will denote whether to lock or not; this patch only implements this for LUQueryInstances - the luxi query functions take an additional argument use_locking which is passed to the master daemon, and then passed to the above opcodes - cli.py export a new SYNC_OPT command line options which implement setting this flag to true - except for gnt-instance list, which uses this option, and for name-only queries (e.g. QueryNodes(fields=["names"])), all other callers are setting this flag to True - RAPI also sets the flag to True The patch was tested with a continuous (0.2s sleep in-between) gnt-instance list during a burnin, and no problems were observed. Reviewed-by: ultrotter
-
- Jan 21, 2009
-
-
Iustin Pop authored
Two are real errors (invalid names) and one is style error (overriding name from outer scope). Reviewed-by: ultrotter
-
- Jan 20, 2009
-
-
Iustin Pop authored
(this is related to the master daemon log) Currently it's not possible to follow (in the non-debug runs) the logical execution thread of jobs. This is due to the fact that we don't log the thread name (so we lose the association of log messages to jobs) and we don't log the start/stop of job and opcode execution. This patch adds a new parameter to utils.SetupLogging that enables thread name logging, and promotes some log entries from debug to info. With this applied, it's easier to understand which log messages relate to which jobs/opcodes. The patch also moves the "INFO client closed connection" entry to debug level, since it's not a very informative log entry. Reviewed-by: ultrotter
-
- Jan 09, 2009
-
-
Iustin Pop authored
The current fork+close fds sequence has deficiencies which are hard to work around: - logging can start logging before we fork (e.g. if we need to emit messages related to master checking), and thus use FDs which we can't track nicely - the queue locks the queue file, and again this fd needs to be kept open which is hard from the main loop (and this error is currently hidden by the fact that we don't log it) Given the above, it's much simpler, in case we will fork later, to close file descriptors right at the beginning of the program, and in Daemonize only close/reopen the stdin/out/err fds. In addition, we also close() the handlers we remove in SetupLogging so that the cleanup is more thorough. Reviewed-by: imsnah
-
- Jan 06, 2009
-
-
Iustin Pop authored
Two bad indentation cases and a missing variable. Reviewed-by: imsnah
-
- Dec 18, 2008
-
-
Michael Hanselmann authored
With a large job queue, auto-archiving jobs can take a very long time, causing timeouts on the luxi RPC layer. With this change, auto- archive returns after half of the RPC timeout has passed. The user will see how many jobs are left unchecked. Reviewed-by: ultrotter
-
- Dec 11, 2008
-
-
Iustin Pop authored
This patch should fix all outstanding epydoc parsing errors; as such, we switch epydoc into verbose mode so that any new errors will be visible. Reviewed-by: imsnah
-
- Dec 02, 2008
-
-
Iustin Pop authored
The ssconf files were not updated by the master failover. We need to push them, and since we already have RPC initialized, we can use the standard ConfigWriter to do so - this will take care of both the config file and the ssconf files. Reviewed-by: imsnah
-
- Nov 26, 2008
-
-
Guido Trotter authored
Since we're not sure ganeti-noded has started yet, we need to create RUN_GANETI_DIR before SOCKET_DIR as well, with the proper permissions. Reviewed-by: imsnah
-
- Nov 25, 2008
-
-
Guido Trotter authored
Before it was in the abstract linux namespace, where unfortunately we couldn't easily check from python the credentials of the connecting clients. Now we also have to remove the file on exit and when starting. Reviewed-by: imsnah
-
Guido Trotter authored
If SOCKET_DIR doesn't exist we create it in the master daemon, before trying to put a socket inside it. Reviewed-by: imsnah
-
- Nov 21, 2008
-
-
Michael Hanselmann authored
Removing the PID file should be the last thing done. This patch makes sure it's also removed when master.server_cleanup() throws an exception. Also initialize logging only after writing the PID file. Reviewed-by: iustinp
-
Michael Hanselmann authored
ganeti-masterd: Add initialization and shutdown of RPC pool. It needs to be shutdown before forking. ganeti.cli: Add decorator function to initialize and shutdown RPC pool. ganeti.rpc: Add functions to initialize and shutdown RPC pool. Throw exception when used without proper initialization. gnt-cluster, gnt-node: Use decorator function to initialize and shutdown RPC pool. Reviewed-by: iustinp
-
- Oct 20, 2008
-
-
Iustin Pop authored
The two main multi-node job queue RPC calls (jobqueue_update, jobqueue_rename) are converted to address-based calls, in order to speed up queue changes. For this, we need to change the _nodes attribute on the jobqueue to be a dict {name: ip}, instead of a set. Reviewed-by: imsnah
-
Iustin Pop authored
Since now we use only one function from the logger module (SetupLogging), we move it to utils.py (which is already imported by all users of this function), and we remove the module. Reviewed-by: imsnah
-
- Oct 16, 2008
-
-
Iustin Pop authored
In order to account for future improvements to master failover, we move the actual data gathering capabilities from ganeti-masterd into bootstrap.py, and we leave only the verification into masterd. The verification procedure is then changed to retry multiple times (up to one minute) in case most nodes do not respond, and also the algorithm is changed to require at least half (but not half+1) votes, since our vote also should count (and we vote for ourselves). Example for consistent (config-wise) cluster: - 5 node cluster, 2 nodes down: still start - 4 node cluster, 2 nodes down: retry for one minute, abort Reviewed-by: ultrotter
-
Iustin Pop authored
This adds the set/reset in the jqueue and luxi modules, and a way to query it in OpQueryConfigValues, and also the comand line interface for it: $ gnt-cluster queue info The drain flag is unset $ gnt-cluster queue drain $ gnt-cluster queue info The drain flag is set $ gnt-cluster queue undrain $ gnt-cluster queue info The drain flag is unset The choice of making the setting via luxi and not an opcode is that opcodes can't be executed when drained, but we don't query via luxi since in the future it might become a cluster property as opposed to a node one. Reviewed-by: imsnah
-
- Oct 15, 2008
-
-
Iustin Pop authored
This patch adds a generic method to identify the ganeti error given its class name, and implements this across the luxi protocol. Reviewed-by: imsnah
-
- Oct 10, 2008
-
-
Iustin Pop authored
This big patch changes the call model used in internode-rpc from standalong function calls in the rpc module to via a RpcRunner class, that holds all the methods. This can be used in the future to enable smarter processing in the RPC layer itself (some quick examples are not setting the DiskID from cmdlib code, but only once in each rpc call, etc.). There are a few RPC calls that are made outside of the LU code, and these calls are left as staticmethods, so they can be used without a class instance (which requires a ConfigWriter instance). Reviewed-by: imsnah
-
- Oct 07, 2008
-
-
Iustin Pop authored
Background: when we have multiple jobs in the queue (more than just a few), many of the jobs (up to the number of threads) will be in state 'running', although many of them could be actually blocked, waiting for some locks. This is not good, as one cannot easily see what is happening. The patch extends the opcode/job possible statuses with another one, waiting, which shows that the LU is in the acquire locks phase. The mechanism for doing so is simple, we initialize (in the job queue) the opcode with OP_STATUS_WAITLOCK, and when the processor is ready to give control to the LU's Exec, it will call a notifier back into the _JobQueueWorker that sets the opcode status to OP_STATUS_RUNNING (with the proper queue locking). Because this mechanism does not save the job, all opcodes on disk will be in status WAITLOCK and not RUNNING anymore, so we also change the load sequence to consider WAITLOCK as RUNNING. With the patch applied, creating in parallel (via burnin) five instances on a five node cluster shows that only two are executing, while three are waiting for locks. Reviewed-by: imsnah
-
- Oct 06, 2008
-
-
Iustin Pop authored
This patch adds a new luxi call that implements auto-archiving of jobs older than a certain age (or -1 for all completed jobs), and the gnt-job command that makes use of this (with 'all' for -1). Reviewed-by: imsnah
-