- 24 May, 2011 1 commit
-
-
Iustin Pop authored
Currently we generate an empty list only for the '-n node' invocation, but for iallocator we still call the iallocator (which needs an RPC call, etc.). By moving the computation of instances outside of the if block, we can return early from the LU. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 16 May, 2011 1 commit
-
-
Iustin Pop authored
This will allow stopping or starting an instance without changing the remembered state. While this seems counter-intuitive at first (it will create cluster verify errors), it can help in a few corner cases: - shutting down an entire cluster for maintenance but without having to remember state - doing testing of Ganeti itself Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 11 May, 2011 1 commit
-
-
Iustin Pop authored
There are multiple bugs with the code checking for N+1 failures in the instance memory changes which needs significant changes, in the meantime we can at least: - change the warning message into an error (--force will skip checks) - only make checks when we increase the memory Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 09 May, 2011 2 commits
-
-
Iustin Pop authored
Currently, when converting an instance from plain to DRBD, the instance is blocked during the entire resync period. This patch adds the --no-wait-for-sync so that the operation finishes as soon as the DRBD sync has started, without waiting for the entire sync. This makes the instance available much faster. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This patch introduces the option of changing an instance's nodes when doing the disk recreation. The rationale is that currently if an instance lives on a node that has gone down and is marked offline, it's not possible to re-create the disks and reinstall the instance on a different node without hacking the config file. Additionally, the LU now locks the instance's nodes (which was not done before), as we most likely allocate new resources on them. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 06 May, 2011 2 commits
-
-
Iustin Pop authored
It makes not sense to show messages like: Fri May 6 02:04:01 2011 - INFO: Resolved given name 'instance18' to 'instance18' So we'll skip the message if the resolved name is identical to the requested one. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
The original code would get all node information and their groups without before acquiring the necessary locks. With this patch the node information is only retrieved once all locks have been acquired. Groups are locked optimistically and verified after acquiring the node locks. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 03 May, 2011 2 commits
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
This removes (count of instances + count of nodes) lock acquires/releases. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 02 May, 2011 2 commits
-
-
Iustin Pop authored
At least one generates an epydoc error :) Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Currently cluster verify doesn't check for bridge information; the only checks are done at instance create and failover/migrate time. This means a cluster that seems healthy will fail creation jobs. This patch implements a simple verification that all nodes (in the entire cluster, so doesn't work well for multi-group) have all the required bridges: the default one plus any instance bridge. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 29 Apr, 2011 2 commits
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
If an iallocator is used, “gnt-instance replace-disks” would acquire the locks of all nodes (only the allocator will decide which node to use). Unfortunately the unneeded locks were not released during the operation, causing unnecessary delays for other jobs. This patch changes the LU to release unneeded locks and adds assertions. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 28 Apr, 2011 2 commits
-
-
Iustin Pop authored
This is a simple change to allow specifying a different VG for the meta device during the creation of instances and addition of disks via gnt-instance modify. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This is a small change to make this function take a list of VG names, instead of a single one. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 27 Apr, 2011 5 commits
-
-
Iustin Pop authored
This patch enhances the multi-VG support in replace disks, by keeping the meta device in the same VG, as opposed to moving it to the data device VG (note that we don't have a way to create the meta in a different VG in the first place, but at least we correctly handle a custom config). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Doug Dumitru authored
Converting an instance from 'plain' to 'drbd'. The old code would create the drbd volumes in the default VG and then the renames would fail. This fix pulls the plain VG names from the existing volumes and places it into the new disk template. Running 'replace-disks' has a similar issue with the new disks going into the wrong VG and then the rename failing. Their might be a similar issue with 'recreate-disks', but I actually have no idea what recreate-disks does, so did not look into it. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
A few issues in the clarity of the error messages are fixed: - "ERROR: node node3: OS API version lenny-image": no preposition between the parameter type and the OS name, changed to "for lenny-image" - "API version lenny-image differs from reference node node1: 10, 5 vs. 10, 20, 5, 15": parameters not sorted in display - "OS variants list lenny-image differs from reference node node1: vs. default, i386": empty sets are not clearly delimited, changed to add [] around the sets: "node node1: [] vs. [default, i386]" - "OS parameters lenny-image differs from reference node node1: vs. (u'dhcp', u'Whether to enable (yes) or disable (dhcp)')": ugly formatting in the OS parameters list, as we used to just "%s" the tuple; now it is "reference node node1: [] vs. [dhcp: Whether to enable (yes) or disable (dhcp)]" Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This breaks Ganeti in multiple ways. If we don't make the check in gnt-node itself, then bootstrap.SetupNodeDaemon will restart the master daemon, making the operation fail: node1# gnt-node add --readd node1 Cannot communicate with the master daemon. Is it running and listening for connections? The check in cmdlib is more of a safety check, as we shouldn't reach it. If we do (via a bad client), then it will prevent breakage in the job queue/config handling. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
IIRC we don't use punctuation at the end of error messages. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 20 Apr, 2011 1 commit
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 19 Apr, 2011 1 commit
-
-
Iustin Pop authored
The current wipe_chunk_size computation is doing min(int_value, float_value). For small disks (below 10GiB), the actual formula will result into the float value being chosen. This results into very interesting behaviour: Wiping disk 0, offset 102.4, chunk 102.4 Wiping disk 0, offset 204.8, chunk 102.4 … Wiping disk 0, offset 921.6, chunk 102.4 Wiping disk 0, offset 1024.0, chunk 1.13686837722e-13 Since these are passed to dd via %d, this will result into the call to dd specifying offset 1024 and count 0, which will fail. We just need to enforce conversion to int, in order to not get bitten by floating point rounding errors. The patch also reorders some logging messages in order to log the chunk size. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 14 Apr, 2011 1 commit
-
-
Michael Hanselmann authored
Ganeti 2.3 introduced an optional feature to overwrite an instance's disks on creation. Unfortunately the code kept all locks while doing the wipe, slowing down the creation of multiple instances in parallel. This patch changes the code to wipe the disks only after releasing the locks. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 13 Apr, 2011 1 commit
-
-
Michael Hanselmann authored
Before this patc the message would look like “Some groups do not exist: [u'foo', u'bar']”, now it's “Some groups do not exist: foo, bar”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 06 Apr, 2011 1 commit
-
-
Michael Hanselmann authored
Until now LUInstanceQueryData always acquired locks for the instance(s) and nodes involved. In combination with long-running operations this prevented the use of “gnt-instance info”, even with the “--static” option. With this patch, locks are only acquired when explicitely requested in the opcode (like all query operations). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 04 Apr, 2011 1 commit
-
-
Iustin Pop authored
This changes the display from: Mon Apr 4 02:29:46 2011 * Verifying N+1 Memory redundancy Mon Apr 4 02:29:46 2011 - ERROR: node node2: not enough memory to accomodate instance failovers should node node1 fail To: Mon Apr 4 02:32:50 2011 * Verifying N+1 Memory redundancy Mon Apr 4 02:32:50 2011 - ERROR: node node2: not enough memory to accomodate instance failovers should node node1 fail (33536MiB needed, 27910MiB available) Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 10 Mar, 2011 2 commits
-
-
Stephen Shirley authored
There is currently no way to reset oob_program back to its default from the cmdline, which causes problems for cluster-merge. This patch means that the following now works: gnt-cluster modify --node-parameters oob_program= Signed-off-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Iustin Pop authored
Nodes can return unknown instances, so we shouldn't use the name as an index without checking. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 04 Mar, 2011 2 commits
-
-
Iustin Pop authored
This LU was introduced before the RPC result conversion from .data to .payload, and it has managed to keep the old-style usage (how? it's the only LU that does so). Fix by changing to payload, and add some extra logging for easier diagnose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com> (cherry picked from commit 043beb38)
-
Iustin Pop authored
Commit 92fd2250 added consistency checks in the RPC layer, which broke the call_blockdev_getsizes RPC call (declared with 's' at the end in rpc.py, without 's' in the node daemon). The immediate fix is to correct the rpc function name, the long term one will be to remove this duplication. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> (cherry picked from commit ccfbbd2d)
-
- 28 Feb, 2011 1 commit
-
-
Iustin Pop authored
For the 2.4 release, we only add the missing RPC calls. However, this needs to be fixed properly, by preventing usage of mis-configured disks. Also add a bit more logging so that it's directly clear on which node the wipe is being done. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 17 Feb, 2011 4 commits
-
-
Iustin Pop authored
Currently, there is at least one LU that does wrong validation of HV parameters (against all nodes, LUClusterSetParams). It's possible to fix this case, but I went and modified the base functions to filter out non-vm_capable nodes so all callers are protected. Note: the _CheckOSParams function is never called with all nodes list, so modifying it shouldn't be needed. However, I think it's safe to do so (and it shouldn't hurt as an instance's node shouldn't ever lack the vm_capable bit). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Because non-vm_capable nodes most likely don't have a hypervisor configured and/or storage, so the call will fail anyway. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This LU was introduced before the RPC result conversion from .data to .payload, and it has managed to keep the old-style usage (how? it's the only LU that does so). Fix by changing to payload, and add some extra logging for easier diagnose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
Commit 92fd2250 added consistency checks in the RPC layer, which broke the call_blockdev_getsizes RPC call (declared with 's' at the end in rpc.py, without 's' in the node daemon). The immediate fix is to correct the rpc function name, the long term one will be to remove this duplication. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
- 10 Feb, 2011 1 commit
-
-
Iustin Pop authored
Commit a1cef11c fixed non-vm_capable nodes export, but broke inadvertently offline nodes. The update of the dict only needs to happen for online nodes, in the 'if' block. Without this patch, offline nodes keep the data from the last node that was not offline; end result is that all nodes are considered online (unless the first node is offline, in which case an error will be raised). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 09 Feb, 2011 3 commits
-
-
Iustin Pop authored
Currently, for both primary and secondary offline nodes, we give the same message: - ERROR: instance instance14: instance lives on offline node(s) node3 - ERROR: instance instance15: instance lives on offline node(s) node3 - ERROR: instance instance16: instance lives on offline node(s) node3 - ERROR: instance instance17: instance lives on offline node(s) node3 This is confusing, as an offline primary is in a different category than a secondary. The patch changes the warnings to have different error messages: - ERROR: instance instance14: instance has offline secondary node(s) node3 - ERROR: instance instance15: instance has offline secondary node(s) node3 - ERROR: instance instance16: instance lives on offline node node3 - ERROR: instance instance17: instance lives on offline node node3 Thanks to Alexander Schreiber <als@google.com> for reporting this issue. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Alexander Schreiber <als@google.com>
-
Iustin Pop authored
Currently, cluster-verify says: - ERROR: instance instance14: couldn't retrieve status for disk/0 on node3: node offline - ERROR: instance instance14: instance lives on offline node(s) node3 - ERROR: instance instance15: couldn't retrieve status for disk/0 on node3: node offline - ERROR: instance instance15: instance lives on offline node(s) node3 This is redundant as the “lives on offline node” message should be all we need to understand the cluster situation. The patch fixes this and also corrects a very old idiom. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
Iustin Pop authored
Currently, cluster verify shows warnings N+1 warnings for offline nodes having any redundant instances since the memory data that we have for those nodes is zero, so any instance will trigger the warning. As the comment says, we already list secondary instances on offline nodes, so that warning is enough, and we skip the N+1 one. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com>
-
- 28 Jan, 2011 1 commit
-
-
Iustin Pop authored
This patch implements recreation of instance disk symlinks when the activate-disks operation is run. Until now, it was not possible to re-create these symlinks without stopping and starting or migrating an instance as the RPC call where this is done was in instance startup and migration. In order to do this, the blockdev_assemble rpc call needs the disk index too, which is added to the protocol. This is a change from 2.3 and makes instance startup incompatible (FYI). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-