• Iustin Pop's avatar
    Parallelise instance allocation/capacity computation · f828f4aa
    Iustin Pop authored
    This patch finally enables parallelisation in instance placement.
    
    My original try for enabling this didn't work well, but it took a
    while (and liberal use of threadscope) to understand why. The attempt
    was to simply `parMap rwhnf` over allocateOnPair, however this is not
    good as for a 100-node cluster, this will create roughly 100*100
    sparks, which is way too much: each individual spark is too small, and
    there are too many sparks. Furthermore, the combining of the
    allocateOnPair results was done single-threaded, losing even more
    parallelism. So we had O(n²) sparks to run in parallel, each spark of
    size O(1), and we combine single-threadedly a list of O(n²) length.
    
    The new algorithm does a two-stage process: we group the list of valid
    pairs per primary node, relying on the fact that usually the secondary
    nodes are somewhat balanced (it's definitely true for 'blank' cluster
    computations). We then run in parallel over all primary nodes, doing
    both the individual allocateOnPair calls *and* the concatAllocs
    summarisation. This leaves only the summing of the primary group
    results together for the main execution thread. The new numbers are:
    O(n) sparks, each of size O(n), and we combine single-threadedly a
    list of O(n) length.
    
    This translates directly into a reasonable speedup (relative numbers
    for allocation of 3 instances on a 120-node cluster):
    
    - original code (non-threaded): 1.00 (baseline)
    - first attempt (2 threads):    0.81 (20% slowdown
    
    )
    - new code (non-threaded):      1.00 (no slowdown)
    - new code (threaded/1 thread): 1.00
    - new code (2 threads):         1.65 (65% faster)
    
    We don't get a 2x speedup, because the GC time increases. Fortunately
    the code should scale well to more cores, so on many-core machines we
    should get a nice overall speedup. On a different machine with 4
    cores, we get 3.29x.
    Signed-off-by: default avatarIustin Pop <iustin@google.com>
    Reviewed-by: default avatarAgata Murawska <agatamurawska@google.com>
    f828f4aa
Cluster.hs 62.7 KB