- Apr 06, 2011
-
-
Iustin Pop authored
This has been observed to cause problems on real clusters via the following mechanism: - a long job (e.g. a replace-disks) is keeping an exclusive lock on an instance - the watcher starts and submits its query instances opcode which wants shared locks for all instances - after about an hour, the watcher job falls back to blocking acquire, after having acquired all other locks - any instance opcode that wants an exclusive lock for an instance cannot start until the watcher has finished, even though there's no actual operation on that instance In order to alleviate this problem, we simply increase the max timeout until lock acquires are sent back to either blocking acquire or priority increase. The timeout is computed such that we wait ~10 hours (instead of one) for this to happen, which should be within the maximum lifetime of a reasonable opcode on a healthy cluster. The timeout also means that priority increases will happen every half hour. We also increase the max wait interval to 15 seconds, otherwise we'd have too many retries with the increased interval. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Apr 04, 2011
-
-
Iustin Pop authored
Before this, the output in the rapi daemon log was: 2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file at /var/lib/ganeti/rapi/users 2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon startup Which is confusing, as it might look like the read of the users file is part of the previous run. This is because we log the 'daemon startup' message after the prepare_fn, which can log things on its own. The patch simply moves the 'daemon startup' message just before prepare_fn call. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This changes the display from: Mon Apr 4 02:29:46 2011 * Verifying N+1 Memory redundancy Mon Apr 4 02:29:46 2011 - ERROR: node node2: not enough memory to accomodate instance failovers should node node1 fail To: Mon Apr 4 02:32:50 2011 * Verifying N+1 Memory redundancy Mon Apr 4 02:32:50 2011 - ERROR: node node2: not enough memory to accomodate instance failovers should node node1 fail (33536MiB needed, 27910MiB available) Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Mar 31, 2011
-
-
Iustin Pop authored
This is not needed for this function, and can interfere with debugging of ssh failures. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Mar 24, 2011
-
-
Michael Hanselmann authored
If the result of an opcode was a non-empty dictionary, it would be impossible to differenciate between input and result: Input fields: […] debug_level: 0 fields: cluster_name,master_node,volume_group_name jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']] Expected output: Input fields: […] debug_level: 0 fields: cluster_name,master_node,volume_group_name Result: jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']] Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Mar 17, 2011
-
-
Michael Hanselmann authored
When “ganeti-watcher” is called with an argument, it would hint at a non-existing “-f” parameter. With this patch the separate usage string is no longer necessary. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Mar 16, 2011
-
-
Michael Hanselmann authored
In some rare cases it can happen that a lock is re-created very soon after deletion, while the old instance hasn't been destructed yet. In such a case the code would detect a duplicate name and raise an exception. We have seen at least one case where this happened during the creation of many instances. It is not exactly clear how it came to be, but it appears to have occurred while different jobs fought for locks with short timeouts (in the case of instance creation locks are added at this stage and removed shortly after if not all locks can be acquired). The issue is fixed by removing the check for duplicate names. To still guarantee a stable sort order for the lock information as shown by “gnt-debug locks”, a registration number is recorded for each lock in the monitor. A unittest is included to check for the situation. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Mar 15, 2011
-
-
Michael Hanselmann authored
The ability to split a string into a list of strings and integers can be handy elsewhere and is necessary for sorting query results by names. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> (cherry picked from commit f47941f8)
-
- Mar 04, 2011
-
-
Iustin Pop authored
This LU was introduced before the RPC result conversion from .data to .payload, and it has managed to keep the old-style usage (how? it's the only LU that does so). Fix by changing to payload, and add some extra logging for easier diagnose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com> (cherry picked from commit 043beb38)
-
Iustin Pop authored
Commit 92fd2250 added consistency checks in the RPC layer, which broke the call_blockdev_getsizes RPC call (declared with 's' at the end in rpc.py, without 's' in the node daemon). The immediate fix is to correct the rpc function name, the long term one will be to remove this duplication. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> (cherry picked from commit ccfbbd2d)
-
Iustin Pop authored
PollJob returns the whole op_results, hence a list of opcode results. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Feb 28, 2011
-
-
Michael Hanselmann authored
The exception was never actually raised. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Adeodato Simo <dato@google.com>
-
Iustin Pop authored
For the 2.4 release, we only add the missing RPC calls. However, this needs to be fixed properly, by preventing usage of mis-configured disks. Also add a bit more logging so that it's directly clear on which node the wipe is being done. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Feb 25, 2011
-
-
Stephen Shirley authored
Signed-off-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Feb 24, 2011
-
-
Stephen Shirley authored
Signed-off-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Feb 22, 2011
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 18, 2011
-
-
Iustin Pop authored
And also enable verbose display via the, well, verbose option. Man page and tests are updated, and the formatting is moved from 4 if statements to a data structure. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Stephen Shirley authored
Signed-off-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Guido Trotter authored
Before c744425f instance reinstall accepted the "os" and "nostartup" optional query parameters. With that commit it was changed to allow "os" "start" and "osparams" via body rather than encoded in the URL. Unfortunately that commit introduced a bug, which required the "os" parameter to be passed for body requests, and at least one of "os" or "nostartup" for query request. This fix makes sure all parameters are optional again. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 17, 2011
-
-
Iustin Pop authored
Currently, there is at least one LU that does wrong validation of HV parameters (against all nodes, LUClusterSetParams). It's possible to fix this case, but I went and modified the base functions to filter out non-vm_capable nodes so all callers are protected. Note: the _CheckOSParams function is never called with all nodes list, so modifying it shouldn't be needed. However, I think it's safe to do so (and it shouldn't hurt as an instance's node shouldn't ever lack the vm_capable bit). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Since we don't have the data per design, UNAVAIL is appropriate here, while NODATA is not. The patch also adds a comment: if we extend the live fields list to contain other data in the future, we need to reevaluate this solution. This should fix issue 143. The listing now shows (node2==ofline, node3==not vm_capable): Node DTotal DFree MTotal MNode MFree Pinst Sinst node1 698.6G 630.5G 32.0G 1.0G 30.0G 8 7 node2 (offline) (offline) (offline) (offline) (offline) 9 4 node3 (unavail) (unavail) (unavail) (unavail) (unavail) 0 0 Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Because non-vm_capable nodes most likely don't have a hypervisor configured and/or storage, so the call will fail anyway. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This LU was introduced before the RPC result conversion from .data to .payload, and it has managed to keep the old-style usage (how? it's the only LU that does so). Fix by changing to payload, and add some extra logging for easier diagnose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Commit 92fd2250 added consistency checks in the RPC layer, which broke the call_blockdev_getsizes RPC call (declared with 's' at the end in rpc.py, without 's' in the node daemon). The immediate fix is to correct the rpc function name, the long term one will be to remove this duplication. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
- Feb 10, 2011
-
-
Iustin Pop authored
Commit a1cef11c fixed non-vm_capable nodes export, but broke inadvertently offline nodes. The update of the dict only needs to happen for online nodes, in the 'if' block. Without this patch, offline nodes keep the data from the last node that was not offline; end result is that all nodes are considered online (unless the first node is offline, in which case an error will be raised). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- Feb 09, 2011
-
-
Iustin Pop authored
Currently, for both primary and secondary offline nodes, we give the same message: - ERROR: instance instance14: instance lives on offline node(s) node3 - ERROR: instance instance15: instance lives on offline node(s) node3 - ERROR: instance instance16: instance lives on offline node(s) node3 - ERROR: instance instance17: instance lives on offline node(s) node3 This is confusing, as an offline primary is in a different category than a secondary. The patch changes the warnings to have different error messages: - ERROR: instance instance14: instance has offline secondary node(s) node3 - ERROR: instance instance15: instance has offline secondary node(s) node3 - ERROR: instance instance16: instance lives on offline node node3 - ERROR: instance instance17: instance lives on offline node node3 Thanks to Alexander Schreiber <als@google.com> for reporting this issue. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Alexander Schreiber <als@google.com>
-
Iustin Pop authored
Currently, cluster-verify says: - ERROR: instance instance14: couldn't retrieve status for disk/0 on node3: node offline - ERROR: instance instance14: instance lives on offline node(s) node3 - ERROR: instance instance15: couldn't retrieve status for disk/0 on node3: node offline - ERROR: instance instance15: instance lives on offline node(s) node3 This is redundant as the “lives on offline node” message should be all we need to understand the cluster situation. The patch fixes this and also corrects a very old idiom. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
Iustin Pop authored
Currently, cluster verify shows warnings N+1 warnings for offline nodes having any redundant instances since the memory data that we have for those nodes is zero, so any instance will trigger the warning. As the comment says, we already list secondary instances on offline nodes, so that warning is enough, and we skip the N+1 one. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
- Feb 08, 2011
-
-
Stephen Shirley authored
The current code gives: Failure: prerequisites not met for this operation: error type: wrong_input, error details: Selection filter does not match any instances Signed-off-by:
Stephen Shirley <diamond@google.com> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 04, 2011
-
-
Stephen Shirley authored
This is needed so cluster-merge can add nodes from other clusters. Signed-off-by:
Stephen Shirley <diamond@google.com> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 03, 2011
-
-
Iustin Pop authored
Currently, the export timeout is 10 times 20 seconds, but the import is only 30 seconds. I'm raising this to 60 seconds with two goals in mind: - when debugging manually, this allows for easier synchronisation of the processes - 60 equals to 3 full 20 second intervals, which I think is better than just one an a half This change shouldn't make a big difference either way (at most, it will possibly delay the job in case of failures by half a minute). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
In case of failures, the recent daemon output is logged as %r on a list of unicode strings, which results in the (ugly): Thu Feb 3 05:13:34 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: [u' DUMP: Date of this level 0 dump: Thu Feb 3 05:13:18 2011', u' DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output', u' DUMP: Label: none', u' DUMP: Writing 10 Kilobyte records', u' DUMP: mapping (Pass I) [regular files]', u' DUMP: mapping (Pass II) [directories]', u' DUMP: estimated 54301 blocks.', u' DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:13:19 2011', u' DUMP: dumping (Pass III) [directories]', u' DUMP: dumping (Pass IV) [regular files]', u'socat: E SSL_write(): Connection reset by peer', u"dd: dd: writing `standard output': Broken pipe", u' DUMP: Broken pipe', u' DUMP: The ENTIRE dump is aborted.']) This patch joins this list and makes it a non-unicode string, thus resulting in the more readable (and ~10% shorter): Thu Feb 3 05:16:04 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: DUMP: Date of this level 0 dump: Thu Feb 3 05:15:58 2011\n DUMP: Dumping /dev/mapper/6369a5f7-1e67-4d0d-a4f0-956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output\n DUMP: Label: none\n DUMP: Writing 10 Kilobyte records\n DUMP: mapping (Pass I) [regular files]\n DUMP: mapping (Pass II) [directories]\n DUMP: estimated 54350 blocks.\n DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:15:59 2011\n DUMP: dumping (Pass III) [directories]\nsocat: E SSL_write(): Connection reset by peer\ndd: dd: writing `standard output': Broken pipe\n DUMP: Broken pipe\n DUMP: The ENTIRE dump is aborted.) Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This adds a message and nice handling of ^C, especially useful for ``gnt-job watch``. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Michael Hanselmann authored
The new import/export infrastructure in Ganeti 2.2 and up handles compression differently. It no longer writes compressed files to the destination. Unfortunately changing this behaviour would be non-trivial, so in the meantime setting “compression = none” will hopefully avoid some confusion. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Feb 02, 2011
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
This function can be used from a SIGHUP handler to reopen log files. Initial, simple unittests are included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
It's passed in by most users (daemons, CLI scripts) and for the others (burnin, watcher) it certainly doesn't hurt, especially when using syslog. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
The I/O error will occur while opening the file, not while adding and configuring the handler. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Feb 01, 2011
-
-
Stephen Shirley authored
Signed-off-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-