- Mar 04, 2011
-
-
Iustin Pop authored
PollJob returns the whole op_results, hence a list of opcode results. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Feb 03, 2011
-
-
Michael Hanselmann authored
The new import/export infrastructure in Ganeti 2.2 and up handles compression differently. It no longer writes compressed files to the destination. Unfortunately changing this behaviour would be non-trivial, so in the meantime setting “compression = none” will hopefully avoid some confusion. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 26, 2011
-
-
Michael Hanselmann authored
This is analogue to the existing check for a responsive node daemon. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
At least ganeti-confd was not started. It got started a few minutes later by ganeti-watcher. Also move one pylint disable to the effective line. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Also replace hardcoded “xenvg” with constant. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This skips non-vm_capable nodes in the OS diagnose search, since such OSes will not be used anyway on those nodes. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
René Nussbaumer authored
Using auto_promote or auto-promote can lead to confusion on using the user facing interfaces. While auto-promote is fine for CLI it's not for RAPI and vice-versa. This patch should eliminate this confusion. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
This is a followup patch to the one moving GetAllocatable out to module level. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
LVM PV storage units would always show as allocatable, even when they weren't. For some reason I have not been able to determine, the function parsing the attributes (“_GetAllocatable”) was not even called and the list opcode simply returned the attribute string as the value (e.g. “a-”). Removing “@staticmethod” did the trick and then I just moved it to module level. A QA test is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 20, 2011
-
-
Michael Hanselmann authored
With this patch, the exporting node will retry to connect a few times. The receiving node will make use of the master's increased timeout (see previous patch). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
It's been shown that 60 seconds may not be enough to establish a connection. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 07, 2011
-
-
Michael Hanselmann authored
The data was already there, but not shown. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Adeodato Simo <dato@google.com>
-
- Jan 06, 2011
-
-
Michael Hanselmann authored
If the SSH command fails, this will give a more detailed error message than before. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
The source cluster has to shut down an instance before it can be exported. Doing so can take a while, but the default connection timeout is only 60 seconds. Adding the shutdown timeout on the receiving cluster should help. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> (cherry picked from commit dae91d02)
-
- Dec 29, 2010
-
-
Michael Hanselmann authored
Since the recent change to leave jobs in the “waitlock” status (commit 5fd6b694), cancelling a job while it's back in the queue would break. This patch handles these cases and adds a unittest. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 20, 2010
-
-
Michael Hanselmann authored
Point out that jobs already submitted continue to run. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
If the socket can't be read in time, it raises “socket.timeout”, for which there is special handling code. Unfortunately the exception block was in the wrong order and “socket.error” caught it before. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 15, 2010
-
-
Adeodato Simo authored
`gnt-cluster verify` was failing with KeyError if there was any diskless instance in the cluster. This was because _CollectDiskInfo() was not including these instances in the returned dictionary, but they were expected to be present in LUVerifyCluster.Exec(). With this commit, we ensure that the dictionary returned by _CollectDiskInfo includes entries for diskless instances as well. Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Iustin Pop reported that a job's file is updated many times while it waits for locks held by other thread(s). After an investigation it was concluded that the reason was a design decision for job priorities to return jobs to the “queued” status if they couldn't acquire all locks. Changing a jobs' status or priority requires an update to permanent storage. In a high-level view this is what happens: 1. Mark as waitlock 2. Write to disk as permanent storage (jobs left in this state by a crashing master daemon are resumed on restart) 3. Wait for lock (assume lock is held by another thread) 4. Mark as queued 5. Write to disk again 6. Return to workerpool Another option originally discussed was to leave the job in the “waitlock” status. Ignoring priority changes, this is what would happen: 1. If not in waitlock 1.1. Assert state == queued 1.2. Mark as waitlock 1.3. Set start_timestamp 1.4. Write to disk as permanent storage 3. Wait for locks (assume lock is held by another thread) 4. Leave in waitlock 5. Return to workerpool Now let's assume the lock is released by the other thread: […] 3. Wait for locks and get them 4. Assert state == waitlock 5. Set state to running 6. Set exec_timestamp 7. Write to disk As this change reduces the number of writes from two per lock acquire attempt to two per opcode and one per priority increase (as happens after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until the highest priority is reached), here's the patch to implement it. Unittests are updated. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 09, 2010
-
-
Iustin Pop authored
Commit b8d26c6e added disk status verification, but it has two (different) bugs for not healthy nodes. For offline nodes, we don't add at all the disk status to the instance/node dict, with the result that the instance is not present in the instdisk dict if all of its nodes are offline. This creates a KeyError later when we call VerifyInstance with instdisk[instance]. For online nodes, but which don't return a valid disk status, we simply set the status to None for each disk, but the code in _VerifyInstance presumes and requires that each status is a valid tuple of length two. For both these bugs, we redo the instdisk computations to always include valid data, and we enhance the asserts to check for consistency. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
Guido Trotter authored
Currently the code wrongly changes the disk logical/physical id component representing the path from "$storage_dir/$iname/disk$seq" to "$storage_dir/$iname/disk/$seq" (note the additional slash) breaking the rename. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 01, 2010
-
-
Michael Hanselmann authored
Just being told that a lock doesn't exist can be confusing. One case were this happens is when a job (e.g. instance modify) waits for a job removing the instance (e.g. export with remove). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This uses an option only available in patched socat versions. More information is available from the INSTALL update included in this patch. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 30, 2010
-
-
Adeodato Simo authored
Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Nov 18, 2010
-
-
Iustin Pop authored
Currently, reinstallation of a DRBD instance with the secondary node offline does: node1# gnt-instance reinstall -f instance1 Waiting for job 139053 for instance1... Thu Nov 18 01:36:09 2010 - WARNING: Could not prepare block device disk/0 on node node3 (is_primary=False, pass=1): Node is marked offline Thu Nov 18 01:36:09 2010 - WARNING: Could not shutdown block device disk/0 on node node3: Node is marked offline Job 139053 for instance1 has failed: Failure: command execution error: Disk consistency error Since this fails anyway, let's check the secondary nodes, thus preventing any modifications to the instance (e.g. OS type change): node1# gnt-instance reinstall -f instance1 Waiting for job 139058 for instance1... Job 139058 for instance1 has failed: Failure: prerequisites not met for this operation: error type: wrong_state, error details: Instance secondary node offline, cannot reinstall: node3 The patch needs modifications to the _CheckNodeOnline function, in order to display meaningful messages ("Can't use offline node" would be very confusing for an instance reinstall, since we didn't select a node manually). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Iustin Pop authored
I was using the feedback_fn function incorrectly (it doesn't automatically expand the arguments). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Nov 17, 2010
-
-
Iustin Pop authored
Since the contents of the dict is validated via the ForceDictType, we can simply require that it is a dict here. The previous check was wrong, as it was copied from the HV checks (which also doesn't verify the leaf dict type). Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- Nov 11, 2010
-
-
Iustin Pop authored
And fix an error message. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
David Knowles authored
Note: It appears this has been around since the initial checkin of TemporaryReservationManager. I have no idea what this could break, so someone else may want to test this more thoroughly. Signed-off-by:
David Knowles <dknowles@google.com> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Nov 03, 2010
-
-
Michael Hanselmann authored
Tests have shown that the changes in commit b8d26c6e don't work as wanted. If any disk wasn't found on the node, all disks located on the same node would show as faulty. The cause was incorrect exception handling on the node. This patch changes the RPC call to return a per-disk success/error status, avoiding the problem. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Luca Bigliardi <shammash@google.com>
-
Michael Hanselmann authored
Some of then were forgotten. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Nov 01, 2010
-
-
Guido Trotter authored
We can now change a nodes' secondary ip. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
This is already disabled for the same type of request a couple of lines above. The new code was introduced in e986f20c but didn't have the disables. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
There is no "private" ip in Ganeti, we only have primary and secondary ones. Whether they are public or private is a per-installation detail. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
The "I always wanted to do this" commit. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Guido Trotter authored
Changing the volume group is a lot less frequent than acting on a node group. As such we drop the "-g" shortcut and require the long option to be passed. In 2.3 the commands which used to accept the volume group as "-g" won't have any node group option, so no confusion will arise. Later on we may pass "-g" as the initial node group name to gnt-cluster init, although that's not strictly necessary, as modifying it later is always possible. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-