- 25 May, 2009 1 commit
-
-
Iustin Pop authored
Currently the watcher is broken when the queue is full, thus not fulfilling its job as a queue cleaner. It also doesn't handle nicely the queue drained status. This patch does a few changes: - first archive jobs, and only after submit jobs; this fixes the case where the queue is already full and there are jobs suited for archiving (but not the case where the jobs all too young to be archived) - handle nicely the job queue full and drained cases—instead of tracebacks, log such cases nicely - reverse the initial value and special cases for update_file; we now whitelist instead of blacklist cases, since we have much more blacklist cases than vice versa, and we set the flag to True only after the run is successful The last change, especially, is a significant one: now errors during the watcher run will not update the status file, and thus they won't be lost again in the logs. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 20 May, 2009 1 commit
-
-
Iustin Pop authored
This patch modifies the watcher to keep on-disk a file with the instance status; this can be used from outside of ganeti to react to instances being down (when the watcher cannot restart them). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 19 May, 2009 1 commit
-
-
Iustin Pop authored
Bugs in either our code or in associated libraries can bring the master daemon down, and this (due to the 2.0 architecture) stops all work on the cluster. Since the watcher already does periodic checks on the cluster, we modify it to try to start the master automatically in case of failures to connect. This will be tried only once per cycle. Also, in this case, we modify the code so that the watcher status file is not updated - its timestamp will reflect thus the time of last successful connection to the master. Side note: the except errors.ConfigurationError part could be cleaned up, since in 2.0 we don't usually get that directly, and if we do it's an error and we shouldn't touch the file anyway; but that is not a rc5 change. Signed-off-by:
Iustin Pop <iustin@google.com>
-
- 06 Apr, 2009 2 commits
-
-
Iustin Pop authored
Currently the watcher spews errors message on non-master nodes. This cleans it up. Reviewed-by: imsnah
-
Iustin Pop authored
As per the mailing list discussion, this patch changes the watcher to use a single job (two opcodes) for getting the cluster state (node list and instance list); it will then compute the needed actions based on this data. The patch also archives this job and the verify-disks job. Reviewed-by: imsnah
-
- 09 Mar, 2009 1 commit
-
-
Iustin Pop authored
Currently, the watcher startup sequence does: - open a luxi client - get the instance list - get the node boot ids - open and lock the status file, and: - archive jobs - restart the down instances - check disks This, of course, can lead to problems when a node is (genuinely or not) locked for more than (watcher interval * maximum query clients) time. At that time, the master is completely unresponsive until the node is unlocked and all the watchers exit with error due to the state file being locked by the first instance. This patch reworks the startup sequence to first open/lock the status file, and only then open a luxi client. This should prevent the above case. Reviewed-by: ultrotter
-
- 24 Feb, 2009 1 commit
-
-
Iustin Pop authored
This patch removes the extra_args parameter and instead switches the instance to the HV_KERNEL_ARGS hypervisor option. This is a big change, but it's a needed cleanup, this extra parameter on all RPC calls is not generic and we also need to have a persistent value here. Reviewed-by: imsnah
-
- 16 Feb, 2009 2 commits
-
-
Iustin Pop authored
The recent change (commit 2151) to the watcher to make it handle offline nodes also saves the offline attribute to the state file, but this is not needed and also breaks the checking of the boot ID. This patch simply removes it, restoring the correct behaviour. Reviewed-by: imsnah
-
Iustin Pop authored
This patch adds auto-archiving of jobs older than 6 hours to the watcher. Reviewed-by: imsnah
-
- 04 Feb, 2009 1 commit
-
-
Iustin Pop authored
This patch adds the framework for, and enables lockless OpQueryInstances. This means that instances will be shown in ERROR_up or ERROR_down state, even though this is not an error (but just an in-progress job). The framework is implemented as follows: - the OpQueryInstances, OpQueryNodes and OpQueryExports opcodes take an additional “use_locking” flag which will denote whether to lock or not; this patch only implements this for LUQueryInstances - the luxi query functions take an additional argument use_locking which is passed to the master daemon, and then passed to the above opcodes - cli.py export a new SYNC_OPT command line options which implement setting this flag to true - except for gnt-instance list, which uses this option, and for name-only queries (e.g. QueryNodes(fields=["names"])), all other callers are setting this flag to True - RAPI also sets the flag to True The patch was tested with a continuous (0.2s sleep in-between) gnt-instance list during a burnin, and no problems were observed. Reviewed-by: ultrotter
-
- 13 Jan, 2009 1 commit
-
-
Iustin Pop authored
Reviewed-by: imsnah
-
- 11 Dec, 2008 1 commit
-
-
Iustin Pop authored
This patch should fix all outstanding epydoc parsing errors; as such, we switch epydoc into verbose mode so that any new errors will be visible. Reviewed-by: imsnah
-
- 05 Dec, 2008 1 commit
-
-
Iustin Pop authored
This patch changes the LUQueryInstances to show a different state for offline nodes and also modifies the watcher to understand the offline state in its checks. Reviewed-by: ultrotter
-
- 20 Oct, 2008 1 commit
-
-
Iustin Pop authored
Since now we use only one function from the logger module (SetupLogging), we move it to utils.py (which is already imported by all users of this function), and we remove the module. Reviewed-by: imsnah
-
- 01 Oct, 2008 4 commits
-
-
Michael Hanselmann authored
Use RPC calls instead of ssconf. Reviewed-by: iustinp
-
Iustin Pop authored
The watcher didn't handle the down nodes, fix this by ignoring (in secondary node reboot checks) any node that doesn't return a boot id. Reviewed-by: imsnah
-
Iustin Pop authored
The watcher was using conflicting attributes of the instance: - it queried the admin_/oper_state, which are booleans - but it compared those to the status (which is a text field) The code was changed to query the aggregated 'status' field, as that will also return indication of node problems, and we can use this only one field for all decisions. We still ask for the admin_state field as that is needed for the activate disks check (in secondary node restart). The patch also touches the watcher in some other parts: - log exceptions nicer - convert a method to @staticmethod - remove unused imports Reviewed-by: imsnah
-
Iustin Pop authored
The watcher has one last use of ganeti commands as opposed to sending requests via luxi. The patch changes this to use the cli functions. The patch also has two other changes: - fix the docstring for OpVerifyDisks (found out while converting this) - enable stderr logging on the watcher when “-d” is passes Reviewed-by: imsnah
-
- 07 Aug, 2008 1 commit
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
- 30 Jul, 2008 1 commit
-
-
Iustin Pop authored
The 'old-style' info, error, debug logs do not make much sense. This patch unifies the SetupLogging and SetupDaemon functions. As a result, all the commands logs to a 'commands.log' file. The patch also changes the log setup to keep going if there's an error in setting up the file logging but we're logging to stderr. Also, burnin now logs to its own file (burnin.log). Reviewed-by: ultrotter
-
- 10 Jul, 2008 1 commit
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
- 04 Jul, 2008 1 commit
-
-
Iustin Pop authored
This patch fixes two bugs: - the state file is not saved because we use the method for checking for udpated data - in two places 'Error' was used instead of 'Exception', which breaks error handling Additionally: - the unused 're' import has been removed - a variable named 'id' which collides with a builtin function has been renamed Note that comparing the serialized forms might create false negatives (due to the dicts being reordered) but that will just cause an extra write of the file, which is sub-optimal but harmless. Reviewed-by: ultrotter
-
- 03 Jul, 2008 1 commit
-
-
Iustin Pop authored
It's better for daemons if: - they log only to one log file - the log level is included - for debug runs, the filename/line number is included This patch moves the custom formatter from the watcher to the logging module and generalizes it; then it changes the master daemon to use this function instead of the generic logging (which might be deprecated anyway in the future). Reviewed-by: imsnah
-
- 18 Jun, 2008 8 commits
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
This is the safest way to detect changes and the amount of data is small, so keeping a copy around is cheap enough. Reviewed-by: iustinp
-
Michael Hanselmann authored
Cleanup: _data is private and should not be modified from outside of this class. Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
- Lock it before renaming - Code cleanup; close() automatically unlocks it Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
Reviewed-by: ultrotter
-
Michael Hanselmann authored
- Log timestamp for all messages - Write everything to logfile and optionally to stderr - Log messages are no longer buffered, allowing a user to see progress Reviewed-by: ultrotter
-
- 13 May, 2008 2 commits
-
-
Iustin Pop authored
Currently the watcher runs first the instance startup and then the boot-id method of disk reactivation. However, irrelevant of the fact that a node has rebooted or not, if we just started an instance, there's no need for its disks to be activated again, since the start instance has done that (if it is at all possible). The patch modifies the watcher to remember all started instances and not run activate-disks for them. Reviewed-by: ultrotter
-
Iustin Pop authored
Currently the watcher does activate disks (via bootid mechanisms) even for admin_down instances. This patch logs and skips over these instances. Reviewed-by: ultrotter
-
- 12 Dec, 2007 1 commit
-
-
Iustin Pop authored
This patch modifies the watcher to run the ‘gnt-cluster verify-disks’ command and to log its output (if any). Reviewed-by: imsnah
-
- 03 Dec, 2007 1 commit
-
-
Michael Hanselmann authored
- When line wrapping is needed, move spaces to the next line. - Remove embedded line breaks from error messages. Reviewed-by: schreiberal
-
- 13 Nov, 2007 1 commit
-
-
Michael Hanselmann authored
- Use constants for keys. - Fix bug through which automatic instance restarts wouldn't be limited Reviewed-by: iustinp
-
- 10 Oct, 2007 2 commits
-
-
Michael Hanselmann authored
Reviewed-by: iustinp
-
Michael Hanselmann authored
- Change format of watcher state file to JSON. - Move log path for watcher script to constants.py. Reviewed-by: iustinp
-
- 21 Sep, 2007 1 commit
-
-
Iustin Pop authored
We currently require that hostnames are FQDN not short names (node1.example.com instead of node1). We can allow short names as long as: - we always resolve the names as returned by socket.gethostname() - we rely on having a working resolver These issues are not as big as may seem, as we only did gethostname() in a few places in order to check for the master; we already required working resolver all over the code for the other nodes names (and thus requiring the same for the current node name is normal). The patch moves some resolver calls from within execution path to the checking path (which can abort without any problems). It is important that after this patch is applied, no name resolving is called from the execution path (LU.Exec() or other code that is called from within those methods) as in this case we get much better code flow. This patch also changes the functions for doing name lookups and encapsulates all functionality in a single class. The final change is that, by requiring working resolver at all times, we can change the 'return None' into an exception and thus we don't have to check manually each time; only some special cases will check (ganeti-daemon and ganeti-watcher which are not covered by the generalized exception handling in cli.py). The code is cleaner this way. Reviewed-by: imsnah
-
- 14 Aug, 2007 1 commit
-
-
Iustin Pop authored
This changes the raising of exceptions from: raise Exception, value to raise Exception(value) as the first form will be removed in python-3000 and the second form is preferred now. The changes also involve a few cases of changing from raising standard exceptions and use our own ones. The new version also fixes many pylint-generated warnings, especially in ganeti-noded where I changed many methods to @staticmethod. There is no functionality changed (barring any bugs).
-