- 24 Apr, 2014 18 commits
-
-
Petr Pudlak authored
.. so that they are displayed properly in logs. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
Otherwise a job that is being started is falsely reported as dead. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. and add a reason trail message. Otherwise failed jobs hang, never finishing. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
In this example, the cluster has two nodes and four instances, two with primary on each of the nodes. The most scarce resource on this cluster are (virtual) CPUs and the second node has 3 times the CPU speed of the first one. So distributing the instances 1 and 3 gives a more balanced cluster. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...as described in doc/design-cpu-speed.rst Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Add a derived parameter for nodes, providing the ratio of virtual CPUs per CPU-speed weighted physical CPU. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Make the htools luxi backend also query for cpu_speed and take the result into account. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Extend the text format by an optional column for each node containing the relative CPU speed, if provided. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Add a function on nodes modifying the CPU speed parameter. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Add an additional parameter to the representation of a node for the relative CPU speed, initially set to 1. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
In other words, remove "cpu_speed" from all "nodeparams" where it is present, be it cluster, group, or node. Note that upgrading is no problem, as the default value will be used implicitly. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
This parameter will describe the speed of the CPU relative to the speed of a "normal" node in this node group. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...in order not to have to declare floating point values as VTypeInt and rely on the sloppiness of the JSON specification to not distinguish between integers and floating point numbers. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
This document really only talks about CPU speed. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Petr Pudlak authored
In this case, the call trying to acquire a shared lock always succeeds, because the daemon already has an exclusive lock, which falsely reports that the job has died. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
In particular, distinguish the cases when a job could not have been cancelled and when a job has already finished. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. because modifying the queue inside the handler can have unexpected consequences. Since Python 2 doesn't have a nice way how to modify a variable from an inner function, we have to use a list as a wrapper. (Python 3 has the "nonlocal" keyword for it.) Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
- 22 Apr, 2014 6 commits
-
-
Klaus Aehlig authored
When failing a job, add an entry to the reason trail, indicating what made the job fail (e.g., failed to fork or detected job death). Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...to simplify manipulation of them. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...to be able to operate on the MetaOpCode that is behind an InputOpCode (if we're in the right component of the sum). Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
...so that manipulations deep within such an object get more simple. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Move all the definition of objects to a spearate file. In this way, the lense module for JQueue can use these objects, while JQueue can use the lenses. For use outside, we reexport the objects. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Klaus Aehlig authored
Signed-off-by:
Klaus Aehlig <aehlig@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
- 17 Apr, 2014 16 commits
-
-
Petr Pudlak authored
.. and get rid of unnecessary variable binding. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. because with the new mechanism, the process can be slower and the job sometimes returned successfully before it could have been cancelled. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
Make the onTimeWatcher of the job queue scheduler also verify that all notionally running jobs are indeed alive. If a job is found dead, remove it from the list of running jobs and update the job file to reflect the unexpected death. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Petr Pudlak authored
We can only send the signal if the job is alive and if there is a process ID in the job file (which means that the signal handler has been installed). If it's missing, we need to wait and retry. In addition, after we send the signal, we wait for the job to actually die, to retain the original semantics. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. so that it can be viewed what lock file and with what result was tested. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
The functionality is kept the same, but instead of comparing for equality, a more general version based on a predicate is added. This allows to base the condition on only a part of the output. In addition, 'bracket' is added so that inotify data structure is properly cleaned up even if the inner IO action throws an exception. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. so that it's possible to use logging operations there. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
This is a bit problematic as there is no portable way how to list all open file descriptors, and we can't track them all, because they're also opened by third party libraries such as inotify. Therefore we use /proc/self/fd and /dev/fd, which should work for all Linux flavors and most *BSD as well. If both are missing, we don't do anything and just log a warning. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
`orElse` works just as `mplus` of ResultT, but it only requires `MonadError` and doesn't accumulate the errors, it just returns the second one, if both actions fail. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
If the endpoint (such as Luxid or WConfd) isn't running, don't fail immediately. Instead retry (within the given timeout) and try to reconnect. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
On the Python side it was assumed that the blacklisted private parameters were always dictionaries, but since they're optional, they could be 'None' as well. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
Since now each process only creates a 1-job queue, trying to use file locks only causes job deadlock. Also reduce the number of threads running in a job queue to 1. Later the job queue will be removed completely. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
If a Haskell program is compiled with -threaded, then inheriting open file descriptors doesn't work, which breaks our job death detection mechanism. (And on older GHC versions even forking doesn't work.) Therefore let Luxi daemon check and let it fail to start, if it detect it has been compiled with -threaded. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Klaus Aehlig authored
As luxid forks off processes now, it may receive SIGCHLD signals. Hence add a handler for this. Since we obtain the success of the child from the job file, ignoring is good enough. Signed-off-by:
Klaus Aehlig <aehlig@google.com> Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Petr Pudlak <pudlak@google.com>
-
Petr Pudlak authored
.. instead of just letting the master daemon to handle them. We try to start all given jobs independently and requeue those that failed. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-
Petr Pudlak authored
.. which will be used if the Luxi daemon attempts to start a job, but fails. Signed-off-by:
Petr Pudlak <pudlak@google.com> Reviewed-by:
Klaus Aehlig <aehlig@google.com>
-