- Mar 07, 2011
-
-
Iustin Pop authored
* devel-2.2: Fix LUClusterRepairDiskSizes and rpc result usage Fix RPC mismatch in blockdev_getsize[s] Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Mar 04, 2011
-
-
Iustin Pop authored
This LU was introduced before the RPC result conversion from .data to .payload, and it has managed to keep the old-style usage (how? it's the only LU that does so). Fix by changing to payload, and add some extra logging for easier diagnose. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com> (cherry picked from commit 043beb38)
-
Iustin Pop authored
Commit 92fd2250 added consistency checks in the RPC layer, which broke the call_blockdev_getsizes RPC call (declared with 's' at the end in rpc.py, without 's' in the node daemon). The immediate fix is to correct the rpc function name, the long term one will be to remove this duplication. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Stephen Shirley <diamond@google.com> (cherry picked from commit ccfbbd2d)
-
Iustin Pop authored
PollJob returns the whole op_results, hence a list of opcode results. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- Feb 03, 2011
-
-
Michael Hanselmann authored
The new import/export infrastructure in Ganeti 2.2 and up handles compression differently. It no longer writes compressed files to the destination. Unfortunately changing this behaviour would be non-trivial, so in the meantime setting “compression = none” will hopefully avoid some confusion. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 26, 2011
-
-
Michael Hanselmann authored
This is analogue to the existing check for a responsive node daemon. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
At least ganeti-confd was not started. It got started a few minutes later by ganeti-watcher. Also move one pylint disable to the effective line. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
The fact that jobs don't necessarily execute in order has been source for some confusion. Hopefully this update will clarify things. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Also replace hardcoded “xenvg” with constant. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This skips non-vm_capable nodes in the OS diagnose search, since such OSes will not be used anyway on those nodes. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
René Nussbaumer authored
Using auto_promote or auto-promote can lead to confusion on using the user facing interfaces. While auto-promote is fine for CLI it's not for RAPI and vice-versa. This patch should eliminate this confusion. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
This is a followup patch to the one moving GetAllocatable out to module level. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
LVM PV storage units would always show as allocatable, even when they weren't. For some reason I have not been able to determine, the function parsing the attributes (“_GetAllocatable”) was not even called and the list opcode simply returned the attribute string as the value (e.g. “a-”). Removing “@staticmethod” did the trick and then I just moved it to module level. A QA test is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 20, 2011
-
-
Michael Hanselmann authored
With this patch, the exporting node will retry to connect a few times. The receiving node will make use of the master's increased timeout (see previous patch). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
It's been shown that 60 seconds may not be enough to establish a connection. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 14, 2011
-
-
Guido Trotter authored
burnin is a cluster/testing feature, so it makes sense that a hidden OS can be used for it. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 12, 2011
-
-
Stephen Shirley authored
Also change language slightly for preferred groups to look better now that it's repeated. Signed-off-by:
Stephen Shirley <diamond@google.com> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 07, 2011
-
-
Michael Hanselmann authored
The data was already there, but not shown. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Adeodato Simo <dato@google.com>
-
- Jan 06, 2011
-
-
Michael Hanselmann authored
If the SSH command fails, this will give a more detailed error message than before. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
The source cluster has to shut down an instance before it can be exported. Doing so can take a while, but the default connection timeout is only 60 seconds. Adding the shutdown timeout on the receiving cluster should help. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> (cherry picked from commit dae91d02)
-
Michael Hanselmann authored
When the source cluster takes too long to create a snapshot, the destination would time out. Unfortunately no good error message was written unless debug logging was enabled, not even to the log file. This will be improved with this patch. Another patch to be backported from master will hopefully avoid this situation completely. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
- Check hostname and abort if it doesn't match contents of “ssconf_master_node”, can be overridden using “--ignore-hostname” parameter. - Clarify confirmation question and don't mention instances anymore. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
No need to copy this snippet around, “make” can work harder for us. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
This will allow distributions to install the file as text documentation. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Jan 05, 2011
-
-
Michael Hanselmann authored
This patch formats the upgrade notes currently in the wiki[1] as reST and adds them to the documentation. [1] http://code.google.com/p/ganeti/wiki/UpgradeNotes Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 31, 2010
-
-
Michael Hanselmann authored
s/os-name/os-type/. This was reported in issue 133. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 29, 2010
-
-
Michael Hanselmann authored
Since the recent change to leave jobs in the “waitlock” status (commit 5fd6b694), cancelling a job while it's back in the queue would break. This patch handles these cases and adds a unittest. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 20, 2010
-
-
Michael Hanselmann authored
Point out that jobs already submitted continue to run. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
If the socket can't be read in time, it raises “socket.timeout”, for which there is special handling code. Unfortunately the exception block was in the wrong order and “socket.error” caught it before. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
* stable-2.3: Prepare 2.3.1 release Fix disk status verification in LUClusterVerify Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 17, 2010
-
-
Michael Hanselmann authored
“gnt-cluster verify” looks at some per-instance information as well, so it should be run for each instance type QA tests. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 16, 2010
-
-
Michael Hanselmann authored
The “ensure-dirs” script as included in Ganeti 2.3 is very slow when working with big queues requiring a change of permissions on many or all files. $ find /var/lib/ganeti/queue/ | wc -l 52354 Before this change: $ time /usr/local/lib/ganeti/ensure-dirs -f real 16m4.739s While not adressed in this patch, I'd like to record the overall ineffiency of the “ensure-dirs” script, even after this change: $ time /usr/local/lib/ganeti/ensure-dirs -f real 5m57.362s […] $ strace -e clone,execve -f -c /usr/local/lib/ganeti/ensure-dirs -f % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 50.08 5.147090 49 104774 clone 49.92 5.131094 49 104739 execve More changes will be needed. Just for comparision, a small Python snippet changing permissions on all files (“ensure-dirs” changes the owner too): $ time python -c 'import os; from ganeti import utils; [os.chmod(i, 0644) for i in utils.ListVisibleFiles("/var/lib/ganeti/queue/archive/big")]' real 0m0.605s […] Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- Dec 15, 2010
-
-
Adeodato Simo authored
`gnt-cluster verify` was failing with KeyError if there was any diskless instance in the cluster. This was because _CollectDiskInfo() was not including these instances in the returned dictionary, but they were expected to be present in LUVerifyCluster.Exec(). With this commit, we ensure that the dictionary returned by _CollectDiskInfo includes entries for diskless instances as well. Signed-off-by:
Adeodato Simo <dato@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Iustin Pop reported that a job's file is updated many times while it waits for locks held by other thread(s). After an investigation it was concluded that the reason was a design decision for job priorities to return jobs to the “queued” status if they couldn't acquire all locks. Changing a jobs' status or priority requires an update to permanent storage. In a high-level view this is what happens: 1. Mark as waitlock 2. Write to disk as permanent storage (jobs left in this state by a crashing master daemon are resumed on restart) 3. Wait for lock (assume lock is held by another thread) 4. Mark as queued 5. Write to disk again 6. Return to workerpool Another option originally discussed was to leave the job in the “waitlock” status. Ignoring priority changes, this is what would happen: 1. If not in waitlock 1.1. Assert state == queued 1.2. Mark as waitlock 1.3. Set start_timestamp 1.4. Write to disk as permanent storage 3. Wait for locks (assume lock is held by another thread) 4. Leave in waitlock 5. Return to workerpool Now let's assume the lock is released by the other thread: […] 3. Wait for locks and get them 4. Assert state == waitlock 5. Set state to running 6. Set exec_timestamp 7. Write to disk As this change reduces the number of writes from two per lock acquire attempt to two per opcode and one per priority increase (as happens after 24 acquire attempts (see mcpu._CalculateLockAttemptTimeouts) until the highest priority is reached), here's the patch to implement it. Unittests are updated. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
- Verify job file updates - Ensure queue lock is released while executing opcode Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-