- 19 Mar, 2012 1 commit
-
-
Iustin Pop authored
If a specific list of groups has been requested, then the code used that, without transforming it to a (frozen)set first, which results in: unsupported operand type(s) for &: 'list' and 'frozenset' Trivial fix is to do that in the 'then' branch. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 31 Jan, 2012 1 commit
-
-
Michael Hanselmann authored
Just using ht.TListOf as a type check doesn't work correctly. The function must be called with the expected item type. In this specific case TListOf was always called with the filter as a value, and the result of that call evaluated to truth. Since filters can be quite complex there's no check yet, and therefore just “TList” is used. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 26 Jan, 2012 1 commit
-
-
Iustin Pop authored
Furthermore, correct the --help display on evacuate. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 25 Jan, 2012 1 commit
-
-
Michael Hanselmann authored
This patch attempts to fix a number of issues with “gnt-cluster verify” in presence of multiple node groups and DRBD8 instances split over nodes in more than one group. - Look up instances in a group only by their primary node (otherwise split instances would be considered when verifying any of their node's groups) - When gathering additional nodes for LV checks, just compare instance's node's groups with the currently verified group instead of comparing against the primary node's group - Exclude nodes in other groups when calculating N+1 errors and checking logical volumes Not directly related, but a small error text is also clarified. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 20 Jan, 2012 1 commit
-
-
Guido Trotter authored
Cleanup just updates the config with the correct location of the instance, or informs of its down status, but never starts it. As such there's no point in checking for enough free memory. Actually this check could prevent a perfectly safe cleanup operation if a node is busy. Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 06 Jan, 2012 1 commit
-
-
Guido Trotter authored
This of course was working for all the rcs, but broke with 1.0 itself. In addition: - split between running kvm --version and parsing its output - unittest parsing for various known --help outputs - updated NEWS file - happy 2012 wishes - the hope to finish this patch before it's time to say happy easter :) Signed-off-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 21 Dec, 2011 2 commits
-
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
When an opcode is about to be processed its dependencies are evaluated using “_JobDependencyManager.CheckAndRegister”. Due to its nature that function requires a lock on the manager's internal structures. All of this happens while the job queue lock is held in shared mode (required for the job processor). When a job has been processed any pending dependencies are re-added to the job workerpool. Before this patch that would require the manager's lock and then, for adding the jobs, the job queue lock. Since this is in reverse order it will lead to deadlocks. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 30 Nov, 2011 1 commit
-
-
Nikos Skalkotos authored
Fix bug affecting command line options of "keyval" type. Although escaping commands with \ is supported, it is is not applied to the input recursively. Signed-off-by:
Nikos Skalkotos <skalkoto@grnet.gr> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 24 Nov, 2011 5 commits
-
-
Michael Hanselmann authored
The parameter is called “mods”, not “modes”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Andrea Spadaccini <spadaccio@google.com> (cherry picked from commit 1730d4a1)
-
Michael Hanselmann authored
The parameter is called “mods”, not “modes”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Andrea Spadaccini <spadaccio@google.com>
-
Michael Hanselmann authored
Note: This bug only manifests itself in Ganeti 2.5, but since the problematic code also exists in 2.4, I decided to fix it there. If a node was assigned to a new group using “gnt-group assign-nodes” the node object's group would be changed, but not the duplicate member list in the group object. The latter is an optimization to require fewer locks for other operations. The per-group member list is only kept in memory and not written to disk. Ganeti 2.5 starts to make use of the data kept in the per-group member list and consequently fails when it is out of date. The following commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was confirmed using additional logging): $ gnt-group add foo $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name) $ gnt-cluster verify # Fails with KeyError This patch moves the code modifying node and group objects into “config.ConfigWriter” to do the complete operation under the config lock, and also to avoid making use of side-effects of modifying objects without calling “ConfigWriter.Update”. A unittest is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com> (cherry picked from commit 218f4c3d)
-
Michael Hanselmann authored
Note: This bug only manifests itself in Ganeti 2.5, but since the problematic code also exists in 2.4, I decided to fix it there. If a node was assigned to a new group using “gnt-group assign-nodes” the node object's group would be changed, but not the duplicate member list in the group object. The latter is an optimization to require fewer locks for other operations. The per-group member list is only kept in memory and not written to disk. Ganeti 2.5 starts to make use of the data kept in the per-group member list and consequently fails when it is out of date. The following commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was confirmed using additional logging): $ gnt-group add foo $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name) $ gnt-cluster verify # Fails with KeyError This patch moves the code modifying node and group objects into “config.ConfigWriter” to do the complete operation under the config lock, and also to avoid making use of side-effects of modifying objects without calling “ConfigWriter.Update”. A unittest is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Commit c50452c3 added an exception when all instances should be evacuated off a node, but did so in a way which made pylint complain about unreachable code. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 23 Nov, 2011 3 commits
-
-
Michael Hanselmann authored
There is a design issue in the iallocator interface which prevents us from doing this. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
When evacuating a node, only an assertion without informative text was used to check if the necessary node locks had been acquired. This was on top of evaluating the list of nodes without having a node group lock, so this was changed as well. Also update some exception messages to include “retry the operation”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
ConfigWriter.GetAllInstancesInfo returns a dictionary, not a list. Removing a node would fail with “too many values to unpack”. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 15 Nov, 2011 1 commit
-
-
Michael Hanselmann authored
- Commit b7a1c816 changed the LU to generate jobs - Mention documented results in NEWS Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 14 Nov, 2011 1 commit
-
-
Vangelis Koukis authored
Ensure ports previously allocated by calling ConfigWriter's AllocatePort() are returned to the pool of free ports when no longer needed: * Return the network_port of an instance when it is removed * Return the port used by a DRBD-based disk when it is removed Signed-off-by:
Vangelis Koukis <vkoukis@grnet.gr> Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 08 Nov, 2011 1 commit
-
-
Michael Hanselmann authored
If an instance can't be evacuated, only a message would be printed. With this change the operation always aborts. Newly added unittests check for this behaviour. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 04 Nov, 2011 2 commits
-
-
Michael Hanselmann authored
… instead of object with name. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Instances are modified if their disk size doesn't match. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 27 Oct, 2011 1 commit
-
-
Michael Hanselmann authored
If cmdlib.LUNodeMigrate was called for a node without primary instances it would try to submit an empty list of jobs. This was never visible via CLI as there we check the list of primary instances first. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
- 20 Oct, 2011 1 commit
-
-
René Nussbaumer authored
On a master failover some of the archive dirs might have wrong permissions in the non-root model. This is due to the nature of noded still running as root and the job queue is synced that way. This patch will fix this behaviour by setting the permissions accordingly. Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 19 Oct, 2011 2 commits
-
-
René Nussbaumer authored
Signed-off-by:
René Nussbaumer <rn@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
If an instance had actually a missing disk, the type check would fail. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 18 Oct, 2011 1 commit
-
-
Michael Hanselmann authored
Commit e1f23243 changed te LU and opcode for node evacuation to receive a “mode” parameter (among other things). Commit de40437a changed the RAPI code accordingly, but did so for an earlier version of the first patch. Obviously this couldn't work, so here's the fix. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com>
-
- 17 Oct, 2011 1 commit
-
-
Michael Hanselmann authored
Commit d1c172de inadvertently changes the “/2/instances/[instance_name]/replace-disks” resource to use body parameters. There were no QA tests and the issue wasn't noticed. This patch re-introduces support for query parameters and adds a QA test. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Andrea Spadaccini <spadaccio@google.com>
-
- 12 Oct, 2011 1 commit
-
-
Michael Hanselmann authored
We noticed that “ganeti-masterd” can use large amounts of memory, especially on large clusters. Measurements showed a single PycURL client using about 500 kB of heap memory (the actual usage depends on versions, build options and settings). The RPC client uses a per-thread HTTP client pool with one client per node. At this time there are 41 non-main threads (25 for the job queue and 16 for client requests). This means the HTTP client pools use a lot of memory (ca. 200 MB for 10 nodes, ca. 1 GB for 50 nodes). This patch disables the per-thread HTTP client pool. No cleanup of unused code is done. That will be done in the master branch only. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 04 Oct, 2011 1 commit
-
-
Michael Hanselmann authored
If a cluster has any non-master-candidate nodes, those don't contain all files (e.g. config.data). With commit aef59ae7 (March 31st, 2011) the logic was changed and subsequently verifying a cluster with non-mc nodes would complain. This patch fixes this issue by changing the algorithm. It also adds an additional check for files which shouldn't exist on a machine. A newly added unittest is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 03 Oct, 2011 2 commits
-
-
Michael Hanselmann authored
This reverts commit 34aa8b7c . Writing error messages to stderr would also include backtraces, something we tried to avoid in the past. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Michael Hanselmann authored
Commit 64c7b383 changed the RPC call for verifying SSH connections. Unfortunately this case in adding nodes was missed. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 30 Sep, 2011 4 commits
-
-
Michael Hanselmann authored
When verifying a group the code would always check SSH to all nodes in the same group, as well as the first node for every other group. On big clusters this can cause issues since many nodes will try to connect to the first node of another group at the same time. This patch changes the algorithm to choose a different node every time. A unittest for the selection algorithm is included. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
Iustin Pop authored
In the case we submit many pending jobs (> 100) to the masterd, the JobExecutor 'spams' the master daemon with status requests for the status of all the jobs, even though in the end it will only choose a single job for polling. This is very sub-optimal, because when the master is busy processing small/fast jobs, this query forces reading all the jobs from this. Restricting the 'window' of jobs that we query from the entire set to a smaller subset makes a huge difference (masterd only, 0s delay jobs, all jobs to tmpfs thus no I/O involved): - submitting/waiting for 500 jobs: - before: ~21 s - after: ~5 s - submitting/waiting for 1K jobs: - before: ~76 s - after: ~8 s This is with a batch of 25 jobs. With a batch of 50 jobs, it goes from 8s to 12s. I think that choosing the 'best' job for nice output only matters with a small number of jobs, and that for more than that people will not actually watch the jobs. So changing from 'perfect job' to 'best job in the first 25' should be OK. Note that most jobs won't execute as fast as 0 delay, but this is still a good improvement. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Guido Trotter <ultrotter@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
When “gnt-cluster copyfile” failed it would only print “Copy of file … to node … failed”. A detailed message is written using logging.error. Writing error messages to stderr can be helpful in figuring out what went wrong (the messages also go to the log file, but not everyone might know about it). Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
Iustin Pop <iustin@google.com>
-
- 28 Sep, 2011 2 commits
-
-
Iustin Pop authored
The change to enforce boolean results for cluster verify group opcode missed the HooksCallBack, which uses a very ugly 1/0 logic. Furthermore, the logic is wrong, since it unconditionally resets the verify result to true. The patch is changed to simply treat hook failures as failures, and do nothing for offline/nodes. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
Iustin Pop authored
This reverts to the old behaviour in Ganeti 2.4 and before. Signed-off-by:
Iustin Pop <iustin@google.com> Reviewed-by:
Michael Hanselmann <hansmi@google.com>
-
- 22 Sep, 2011 2 commits
-
-
Michael Hanselmann authored
Commit 7fa310f6 (April 1st, 2011) converted the RAPI resource for shutting down an instance to FillOpCode. Unfortunately it missed the fact that the shutdown resource gets its parameters as query arguments. Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com>
-
Michael Hanselmann authored
Signed-off-by:
Michael Hanselmann <hansmi@google.com> Reviewed-by:
René Nussbaumer <rn@google.com> (cherry picked from commit c6e1a3ee ) Signed-off-by:
Michael Hanselmann <hansmi@google.com>
-