- Jul 30, 2010
-
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This patch fixes two issues with job archival. First, the LoadJobFromDisk can return 'None' for no-such-job, and we shouldn't add None to the job list; we can't anyway, as this raises an exception: node1# gnt-job archive foo Unhandled protocol error while talking to the master daemon: Caught exception: cannot create weak reference to 'NoneType' object After fixing this, job archival of missing jobs will just continue silently, so we modify gnt-job archive to log jobs which were not archived and to return exit code 1 for any missing jobs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 29, 2010
-
-
Iustin Pop authored
If we call burning with only existing instance, then it will fail to create any of them, and thus in the removal phase it won't have anything to remove. Since calling luxi.SUBMIT_MULTIPLE_JOBS with an empty job set is an error (and will raise an exception), this creates a very strange error in burnin (which is unfortunately hidden by ExecJobSet()). As such, we modify CommitQueue to return immediately if it has an empty op queue. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently, we require both --force and --force-multiple for skipping the confirmation on instance reinstalls. After offline conversations, this has been deemed to be excessive, and this patch changes the meaning of --force-multiple to be a “stronger” force, and not require both. So, to skip the prompts: - single instance reinstallation requires either --force or --force-multiple - multiple instance reinstallation requires --force-multiple Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently, if a job execution raises a Ganeti-specific error (i.e. subclass of GenericError), then we encode it as (error class, [error args]). This matches the RAPI documentation. However, if we get a non-Ganeti error, then we encode it as simply str(err), a single string. This means that the opresult field is not according to the RAPI docs, and thus it's hard to reliably parse the job results. This patch changes the encoding of a failed job (via failure) to always be an OpExecError, so that we always encode it properly. For the command line interface, the behaviour is the same, as any non-Ganeti errors get re-encoded as OpExecError anyway. For the RAPI clients, it only means that we always present the same type for results. The actual error value is the same, since the err.args is either way str(original_error); compare the original (doesn't contain the ValueError): "opresult": [ "invalid literal for int(): aa" ], with: "opresult": [ [ "OpExecError", [ "invalid literal for int(): aa" ] ] ], Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This can be used from shell-scripts to quickly check the status of the master node, before launching a series of jobs (and handling the failure of the jobs due to masterd other issues). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Since we don't support upgrades from 1.2.4 without restarting the instance, the 'not restarted since 1.2.5' check/error is wrong/misleading. Since the live migration works anyway without the links (it recreates them during the disk reconfiguration anyway), we remove the check and we transform it into a warning (to the node daemon log only, unfortunately). For 2.3, we'll need to change the symlink creation from instance start time to disk activation time (but that requires more RPC changes). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
During a discussion in July 2010 it was decided that we'll stabilize on /2. See message ID <20100716180012.GA9423@google.com> for reference. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
A lot of assertions are used in Ganeti's code. Some unittests even check whether AssertionError is raised in some cases. Explicitely ensuring assertions are evaluated makes sure those tests don't fail and assertions are checked. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
David Knowles authored
Signed-off-by:
David Knowles <dknowles@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
By changing it to a normal parameter, which must be a sequence, we can start using keyword parameters. Before this patch all arguments to “AddTask(self, *args)” were passed as arguments to the worker's “RunTask” method. Priorities, which should be optional and will be implemented in a future patch, must be passed as a keyword parameter. This means “*args” can no longer be used as one can't combine *args and keyword parameters in a clean way: >>> def f(name=None, *args): ... print "%r, %r" % (args, name) ... >>> f("p1", "p2", "p3", name="thename") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: f() got multiple values for keyword argument 'name' Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Document that it should only be called from within RunTask and add an assertion for this. This means we can no longer use a method on the pool and hence remove WorkerPool.ShouldWorkerTerminate. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
HasRunningTask is never used except for an assertion, where we don't really need the lock. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This way fewer private variables of the pool are accesssed by the worker. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
This is related to issue 105. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 28, 2010
-
-
Iustin Pop authored
Most (all?) of our commands use dash-separator: replace-disks, verify-disks, add-tags, etc. “gnt-cluster masterfailover” is an old exception to this rule. The patch replaces it with master-failover, add a compatiblity alias, and updates the documentation for this change. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Since the RAPI QA suite doesn't seem to offer easy testing of failed creations, I didn't add this to the QA. Pointers on how to do it are welcome. The patch also changes the 'os' argument to be required, since that is how the LU expects it, and without it we just fail later instead of directly at submission time. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
“find .” requires that “-path” arguments start with a dot, otherwise they are not matches. Additionally, we also include the QA files in the tags, for easier search while modifying the QA suite. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Currently, if the cli.JobExecutor class is being used, and one of the jobs is being archived before it can check its result, it will raise a stracktrace as _ChooseJob is not prepared to handle this case. This case makes JobExecutor work better with lost jobs (it still reports them as 'failed', but it doesn't break and returns a proper error message), and modifies the generic FormatError to report the JobLost exception properly, instead of as "Unhandled Ganeti Exception". Since JobExecutor is hard to test properly, I only tested this manually, via a fake invocation. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
This patch adds handling of permission errors so that we don't show tracebacks when a non-root user runs a gnt-* command. Since in the future we'll have different permissions, we need to handle this in RAPI too. It also fixes a typo in RAPI error message and the docstrings of LUXI errors. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
The new name is then displayed by the clients. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Manuel Franceschini <livewire@google.com>
-
Manuel Franceschini authored
This patch fixes a bug when gnt-instance rename was invoked with --no-name-check. It renames the internal variables to be consistent with the ones in equivalent instance add code. Furthermore it checks whether and instance rename is invoked with --no-name-check but without --no-ip-check and throws an exception if so. Signed-off-by:
Manuel Franceschini <livewire@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Jul 26, 2010
-
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This doesn't allow addition/removal of individual volumes, only wholesale replace of the entire list. It can be improved later, if we ever get generic container parameters. The man page changes replaces some tabs with spaces (hence the whitespace changes). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This parameter, which is a list of regular expression patterns, will make cluster verify ignore any such LVs. It will not prevent creation or removal of such volumes by the backend code. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
Currently, backend.StartMaster (the function behind this RPC call) will activate the master IP and then, if the start_daemons parameter is true, it will also activate the master role. While this works, it has two issues: - first, it will activate the master IP unconditionally, even if this node will not start the master daemon due to missing votes - second, the activation of the IP is done twice if start_daemons is true, because the master daemon does its own activation too This behaviour seems to be unmodified since Summer 2008, so probably any rationale on why this is done in two places is forgotten. The patch changes so that this function does *either* IP activation or master role activation but not both. So the IP will be activated only once (from the master daemon or from LURenameCluster), and it will only be done if the masterd got enough votes for startup. I can see only one downside to this change: if masterd won't actually start (due to missing votes), RAPI will still start, and without the master IP activated. But this is no worse than before, when both RAPI was running and the IP was activated. Note that the behaviour of StopMaster remains the same, as noone else does the IP removal. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
Currently, the master IP activation is done in the Exec function. Since the original masterd process returns after forking, and Exec is run in the (grand)child process, this means that after 'ganeti-masterd' has returned there are still initialization tasks running. Normally this is not a problem, but in cases where one does quick master failovers, this creates a race condition which hits the QA scripts especially hard. To solve this, and make the startup process cleaner (the system is in steady state after the command has returned, even though masterd startup could still fail), we move the IP activation to Check(). This also allows error messages about the IP activation to be seen on the console. With this patch enabled, I can no longer reproduce the double-failover errors, which were occuring before in 4/5 cases. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This is needed because not just the cli scripts need this decorator, but the master daemon too (and it already duplicated the code once). In cli.py we just leave a stub, so that we don't have to modify all the scripts to import rpc.py. We then change the master daemon code to reuse this decorator, instead of duplicating it. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This patch implements a few changes to the instance handling. First, old instances which no longer exist on the cluster are removed from the state file, to keep things clean. Second, the instance restart counters are reset every 8 hours, since some error cases might be transient (e.g. networking issues, or machine temporarily down), and if the problem takes more than 5 restarts but is not permanent, watcher will not restart the instance. The value of 8 hours is, I think, both conservative (as not to hammer the cluster too often with restarts) and fast enough to clear semi-transient problems. And last, if an instance is not restarted due to exhausted retries, this should be warned, otherwise it's hard to understand why watcher doesn't want to restart an ERROR_down instance. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Jul 23, 2010
-
-
Iustin Pop authored
This patch adds handling of the new 'mode' parameter to the RAPI server, while keeping compatibility with the old mode. Note that in the old mode (when 'live' is being passed), the auto-mode doesn't work. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
See the discussion on the previous patch about this. Basically unless we want to a add a new 'feature' marking for the live migration parameter, there is no simple way to handle this nicely in the client. Given that the client was/is marked as experimental, this patch simply replaces live with mode. This means that this client won't work with 2.1 clusters… Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This is breakage from the original 'live' parameter changes. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
This is needed as now the parameter is no longer boolean, but tri-state. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-