- Jul 14, 2009
-
-
Guido Trotter authored
SimpleStore is a lot less heavyweight than SimpleConfigReader, and to just get the master name we can use that. This is the only usage of SimpleConfigReader currently, but we're not going to delete the class, as new usages will come in for ganeti-confd (in 2.1). Using it there, though, will make the class even more heavy to load, so it makes sense for this simple usage to be converted. Signed-off-by:
Guido Trotter <ultrotter@google.com>
-
- Jul 08, 2009
-
-
Guido Trotter authored
When the parameter is set to True and start_daemons is also True, ganeti-masterd will be started with the new --no-voting --yes-do-it options. This new option is set to True only on masterfailover, when no_voting is used. This changed the behavior from 2.0, where we didn't start the master daemon at all, when this option was used. The manpage is also updated to remove the 2.0 only change. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
This will be used by ganeti-noded to start ganeti-masterd in a --no-voting masterfailover. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jul 07, 2009
-
-
Michael Hanselmann authored
If a user used ^Z to stop the program, poll() in socket.recv would return EAGAIN due to SIGSTOP. This patch changes luxi.Transport.Recv to ignore EAGAIN. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jun 29, 2009
-
-
Iustin Pop authored
There are volume-related rpc calls. This patch renames the ‘volume_list’ call to ‘lv_list’ to make more clear its purpose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jun 15, 2009
-
-
Iustin Pop authored
Since now all functions fail via _Fail, the return True, … is redundant as all normal return paths have it, and thus the True value can be added in the ganeti-noded handler. This means that all functions can now forget about the special result type, and instead return normally, but signal all failures via _Fail(). Only a few functions must be handled specially (the recursive ones). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Since all rpc calls were converted, we can now: - enforce result type to (status, data) - convert all unhandled exceptions to (False, str(err)) This makes sure that all unhandled errors are reported to rpc users. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Currently the OSes have a special, customized error handling: the OS object can represent either a valid OS, or an invalid OS. The associated function, instead of raising other exception or failing, create custom OS objects representing failed OSes. While this was good when no other RPC had failure handling, it's extremely different from how other function in backend.py expect failures to be signalled. This patch reworks this completely: - the OS object always represents valid OSes (the next patch will remove the valid/invalid field and associated constants) - the call_os_diagnose returns instead of a list of OS objects, a list of (name, path, status, diagnose_msg); the status is then used in cmdlib to determine validity and the status and diagnose_msg values are used in gnt-os for display - call_os_get returns either a valid OS or a RPC remote failure (with the error message) - the other functions in backend.py now just call backend.OSFromDisk() which will return either a valid OS object or raise an exception - the bulk of the OSFromDisk was moved to _TryOSFromDisk which returns status, value for the functions which don't want an exception raised The gnt-os list and diagnose commands still work after this patch. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This patch converts the job queue rpc calls to the new style result. It's done in a single patch as there are helper function (in both jqueue and backend) that are used by multiple rpcs and need synchronized change. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This also removes custom post-processing from rpc.py; since this call has only one user, it was simple to move it back to the caller. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This also cleans up its single use in cmdlib.py. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This patch converts this rpc call to the new style result, and also changes in the process the meaning of the QuitGanetiException's arguments and the node daemon rpc call exception handler. The problem with the exception handler is that we used a two-stage one, and the inner used to catch all exception (including this one), so in the logs we always had an exception logged, instead of the normal 'leaving cluster message'. The patch also adds logging of the exception's arguments, so that we have a trail in the logs about the shutdown mode. The exception's arguments were reversed from the normal RPC results style. While it makes somewhat more sense for this exception, we change them such that they match the rpc result format. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This is used in multiple places outside cmdlib.py, so it's a more interesting patch. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This should actually have a function in backend, but it's fine for now. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Since backend.GetInstanceList() is used both as RPC endpoint and as internal function, it can't return (status, value). Instead it returns only valid instance info, and failures are denoted by exceptions; and the ganeti-noded function adds the (True,) status. The patch also fixes a typo. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This is a big change, because we need to cleanup its users too. The call and thus LUVerifyDisks LU used to differentiate between failure at node level and failure at LV level, by returning different types in the RPC result. This is way too complicated for our needs. The patch changes to new style result (easy change), and then: - changes LUVerifyDisks.Exec() to return a tuple of 3-elements instead of 4-elements; we collapse the «nodes not reachable» and «nodes with LVM errors» in a single dict - changes gnt-cluster to parse 3-element results and simplifies the different by-error handling code Note that the status is added in ganeti-noded, and not in the function itself, as the function is used in other places too. This was tested with down nodes and broken VGs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This also removes some code from ganeti-noded and rpc.py, which should not do such processing of data (and be simply glue code). (Or alternatively they could, if we had better infrastructure). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Jun 09, 2009
-
-
Iustin Pop authored
This patch adds a simple failure reporting tool, similar to bdev's _ThrowError. In backend, we move towards the new-style RPC results (of type (status, payload)) and thus functions which use this style can very easily log and return the error message using this new function. The exception is declared here and not in errors.py since it's local to the node-daemon/backend combination. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 27, 2009
-
-
Iustin Pop authored
This (somewhat big) patch adds support for remotely rebooting the nodes via whatever support the hypervisor has for such a concept. For KVM/fake (and containers in the future) this just uses sysrq plus a ‘reboot’ call if the sysrq method failed. For Xen, it first tries the above, and then Xen-hypervisor reboot (we first try sysrq since that just requires opening a file handle, whereas xen reboot means launching an external utility). The user interface is: # gnt-node powercycle node5 Are you sure you want to hard powercycle node node5? y/[n]/?: y Reboot scheduled in 5 seconds The node reboots hopefully after sending the reply. In case the clock is broken, “time.sleep(5)” might take ages (but then I suspect SSL negotiation wouldn't work). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 25, 2009
-
-
Iustin Pop authored
This patch makes the watcher automatically restart the node and rapi daemons, if they are not running (as per the PID file). This is not an exhaustive test; a better one would be TCP connect to the port, and an even better one a simple protocol ping (e.g. get / for rapi and a rpc_call_alive for noded), but since we don't know how they've been started we can't implement it today. rapi would need to write the SSL/port to a file, and noded something similar, so that we know how to connect. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently the watcher is broken when the queue is full, thus not fulfilling its job as a queue cleaner. It also doesn't handle nicely the queue drained status. This patch does a few changes: - first archive jobs, and only after submit jobs; this fixes the case where the queue is already full and there are jobs suited for archiving (but not the case where the jobs all too young to be archived) - handle nicely the job queue full and drained cases—instead of tracebacks, log such cases nicely - reverse the initial value and special cases for update_file; we now whitelist instead of blacklist cases, since we have much more blacklist cases than vice versa, and we set the flag to True only after the run is successful The last change, especially, is a significant one: now errors during the watcher run will not update the status file, and thus they won't be lost again in the logs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- May 21, 2009
-
-
Iustin Pop authored
As a workaround for the job submit timeouts that we have, this patch adds a new luxi call for multi-job submit; the advantage is that all the jobs are added in the queue and only after the workers can start processing them. This is definitely faster than per-job submit, where the submission of new jobs competes with the workers processing jobs. On a pure no-op OpDelay opcode (not on master, not on nodes), we have: - 100 jobs: - individual: submit time ~21s, processing time ~21s - multiple: submit time 7-9s, processing time ~22s - 250 jobs: - individual: submit time ~56s, processing time ~57s run 2: ~54s ~55s - multiple: submit time ~20s, processing time ~51s run 2: ~17s ~52s which shows that we indeed gain on the client side, and maybe even on the total processing time for a high number of jobs. For just 10 or so I expect the difference to be just noise. This will probably require increasing the timeout a little when submitting too many jobs - 250 jobs at ~20 seconds is close to the current rw timeout of 60s. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 20, 2009
-
-
Iustin Pop authored
This patch modifies the watcher to keep on-disk a file with the instance status; this can be used from outside of ganeti to react to instances being down (when the watcher cannot restart them). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- May 19, 2009
-
-
Iustin Pop authored
Bugs in either our code or in associated libraries can bring the master daemon down, and this (due to the 2.0 architecture) stops all work on the cluster. Since the watcher already does periodic checks on the cluster, we modify it to try to start the master automatically in case of failures to connect. This will be tried only once per cycle. Also, in this case, we modify the code so that the watcher status file is not updated - its timestamp will reflect thus the time of last successful connection to the master. Side note: the except errors.ConfigurationError part could be cleaned up, since in 2.0 we don't usually get that directly, and if we do it's an error and we shouldn't touch the file anyway; but that is not a rc5 change. Signed-off-by:
Iustin Pop <iustin@google.com>
-
- May 06, 2009
-
-
Guido Trotter authored
Sometimes reinstalls are slightly different than new installs. For example certain partitions may need to be preserved accross reinstalls. In order to do that on a per-os basis we pass in the INSTANCE_REINSTALL variable to inform the create script about when a reinstall is happening. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- May 05, 2009
-
-
Guido Trotter authored
This allows ganeti-noded to bind only on one interface rather than all the ones on the machine. The default behaviour doesn't change. Signed-off-by:
Guido Trotter <ultrotter@google.com>
-
- May 04, 2009
-
-
Iustin Pop authored
Currently, lib/luxi.py used lib/serializer.py for encoding/decoding messages, but the master daemon uses directly the simplejson module. This is wrong as any non-trivial change to serializer.py will break the master daemon. The patch changes masterd to use exactly the same functions as luxi.py for encoding/decoding of messages. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Apr 06, 2009
-
-
Iustin Pop authored
This patch raises an error in the master daemon in case the user requests a locking query; accordingly, all clients were modified to send only lockless queries. This is short-term fix, for proper fix the clients should be modified to submit a job when the user request a locking query. The other approach would be to ignore the flag passed by the client; this would be worse as client's wouldn't get at least an error. The possible impact of this is multiple: - some commands could have been not converted, and thus fail; this can be remedied easily - the consistency of commands is lost; e.g. node failover will not lock the node *while we get the node info*, so we could miss some data; this is again in the thread of atomic operations which are missing in the current model of query-and-act from gnt-* scripts Reviewed-by: imsnah, ultrotter
-
Iustin Pop authored
Currently the watcher spews errors message on non-master nodes. This cleans it up. Reviewed-by: imsnah
-
Iustin Pop authored
As per the mailing list discussion, this patch changes the watcher to use a single job (two opcodes) for getting the cluster state (node list and instance list); it will then compute the needed actions based on this data. The patch also archives this job and the verify-disks job. Reviewed-by: imsnah
-
Iustin Pop authored
This patch will log data about queries, which are today completely invisible (at the default log level) in the master log file. Reviewed-by: imsnah
-
- Mar 09, 2009
-
-
Iustin Pop authored
Currently, the watcher startup sequence does: - open a luxi client - get the instance list - get the node boot ids - open and lock the status file, and: - archive jobs - restart the down instances - check disks This, of course, can lead to problems when a node is (genuinely or not) locked for more than (watcher interval * maximum query clients) time. At that time, the master is completely unresponsive until the node is unlocked and all the watchers exit with error due to the state file being locked by the first instance. This patch reworks the startup sequence to first open/lock the status file, and only then open a luxi client. This should prevent the above case. Reviewed-by: ultrotter
-
- Feb 27, 2009
-
-
Guido Trotter authored
Some hypervisors (KVM) need RUN_GANETI_DIR to exist even at cluster init time. This patch creates it in InitCluster just before hv parameter checking. Since the code to make list of directories is already repeated twice in the code, and this would be the third time, we abstract it into an utils.EnsureDirs function and we call that one from ganti-noded, ganeti-masterd and bootstrap. Reviewed-by: iustinp
-
- Feb 24, 2009
-
-
Iustin Pop authored
This patch removes the extra_args parameter and instead switches the instance to the HV_KERNEL_ARGS hypervisor option. This is a big change, but it's a needed cleanup, this extra parameter on all RPC calls is not generic and we also need to have a persistent value here. Reviewed-by: imsnah
-
- Feb 16, 2009
-
-
Iustin Pop authored
The recent change (commit 2151) to the watcher to make it handle offline nodes also saves the offline attribute to the state file, but this is not needed and also breaks the checking of the boot ID. This patch simply removes it, restoring the correct behaviour. Reviewed-by: imsnah
-
Iustin Pop authored
This patch adds auto-archiving of jobs older than 6 hours to the watcher. Reviewed-by: imsnah
-
- Feb 13, 2009
-
-
Iustin Pop authored
This patch fixes many small issues related to write functions: - update documentations w.r.t. how to add users - update the instance add function for latest API - add instance delete - fix addition of tags - update some error messages Reviewed-by: imsnah
-
Iustin Pop authored
This patch changes the format of the HTTP error messages from text/html, which is hard to parse from RAPI clients, to JSON which can be automatically parsed. The error message is an object, which contains always three keys: - code, an integer with the error code - message, a short description - explain, holding (if available) a description of the error In order to implement this, there is a bit of change to the http server and executor classes. I've tested and the error handling still works (but less optimal, no error message) in case the error formatting itself raises an exception. Reviewed-by: imsnah
-
Iustin Pop authored
This changes the RAPI error codes for luxi errors; a timeout error is now reported properly as 504, while any other luxi error is reported as 502. It would be good to convert even more errors into proper return codes in the future. Reviewed-by: imsnah
-
Iustin Pop authored
This patch displays a nicer error message compared to the default stacktrace. Reviewed-by: imsnah
-