[Haskell-cafe] Multicore speedup and spark conversion with ghc HEAD

Eric Rannaud eric.rannaud at gmail.com
Thu Sep 2 15:19:42 EDT 2010


Follows three runs of a stupid program that should see a speedup (I believe).

1. Compiled with ghc-6.12.3. Run on 1 thread, 4.0 s (real).
2. Compiled with ghc-6.12.3. Run on 2 threads, spark converted, 5.4 s
(real), 7.7 s (user).
3. Compiled with ghc-6.13.20100831. Run on 2 threads, but spark *not*
converted 4.1 s (real).

All runs with disabled parallel GC, which negatively impacts -N2 runs
(whether the spark is converted or not).

Two questions:

A. Why is run (2) slower than (1)? GC is a bit more expensive, as one
would expect, but not enough to explain the difference (only explains
0.3 s).

B. ghc HEAD doesn't convert the spark. Why is that?

Thanks.

----
module Main where

import Control.Parallel

fac :: Integer -> Integer
fac n = aux n 1
  where aux 0 _ = 0
        aux 1 m = m
        aux n m =
          let p = n * m in
          p `pseq` aux (n - 1) p

main = do
  let x = fac 100000
      y = fac 100001 in
    print (x `par` y `pseq` (x + y))
----


$ ghc-6.12.3 -O2 -threaded -rtsopts Test.hs --make

$ time ./Test +RTS -N1 -qg -sstderr > /dev/null
./Test +RTS -N1 -qg -sstderr
  20,460,660,928 bytes allocated in the heap
       2,492,368 bytes copied during GC
         462,128 bytes maximum residency (3 sample(s))
          76,576 bytes maximum slop
               4 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 35093 collections,     0 parallel,  0.14s,  0.17s elapsed
  Generation 1:     3 collections,     0 parallel,  0.00s,  0.00s elapsed

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  3.84s)       0.00s    (  0.00s)
  Task  2 (bound)  :    3.87s    (  3.84s)       0.14s    (  0.17s)

  SPARKS: 1 (0 converted, 1 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.83s  (  3.84s elapsed)
  GC    time    0.14s  (  0.17s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.97s  (  4.02s elapsed)

  %GC time       3.6%  (4.3% elapsed)

  Alloc rate    5,345,814,403 bytes per MUT second

  Productivity  96.4% of total user, 95.2% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_large_objects: 0

real    0m4.022s
user    0m3.970s
sys     0m0.051s


$ time ./Test +RTS -N2 -qg -sstderr > /dev/null
        ./Test +RTS -N2 -qg -sstderr
  20,600,461,512 bytes allocated in the heap
       3,585,432 bytes copied during GC
         458,480 bytes maximum residency (14 sample(s))
          76,576 bytes maximum slop
               5 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 34803 collections,     0 parallel,  0.43s,  0.48s elapsed
  Generation 1:    14 collections,     0 parallel,  0.00s,  0.00s elapsed

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    3.72s    (  4.91s)       0.24s    (  0.25s)
  Task  2 (bound)  :    3.91s    (  4.91s)       0.19s    (  0.23s)
  Task  3 (worker) :    0.00s    (  4.91s)       0.00s    (  0.00s)

  SPARKS: 1 (1 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.29s  (  4.91s elapsed)
  GC    time    0.43s  (  0.48s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.72s  (  5.39s elapsed)

  %GC time       5.6%  (8.9% elapsed)

  Alloc rate    2,824,733,790 bytes per MUT second

  Productivity  94.4% of total user, 135.2% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_large_objects: 0

real    0m5.397s
user    0m7.724s
sys     0m0.443s



$ ghc-6.13.20100831 -O2 -threaded -rtsopts Test.hs

$ time ./Test +RTS -N2 -qg -sstderr > /dev/null
./Test +RTS -N2 -qg -sstderr
  20,458,286,584 bytes allocated in the heap
       2,705,816 bytes copied during GC
         463,520 bytes maximum residency (3 sample(s))
          73,032 bytes maximum slop
               5 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 35088 collections,     0 parallel,  0.19s,  0.22s elapsed
  Generation 1:     3 collections,     0 parallel,  0.00s,  0.00s elapsed

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  3.92s)       0.00s    (  0.00s)
  Task  2 (bound)  :    3.90s    (  3.92s)       0.19s    (  0.22s)
  Task  3 (worker) :    0.00s    (  3.92s)       0.00s    (  0.00s)

  SPARKS: 1 (0 converted, 1 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.84s  (  3.92s elapsed)
  GC    time    0.19s  (  0.22s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.03s  (  4.14s elapsed)

  %GC time       4.7%  (5.2% elapsed)

  Alloc rate    5,329,874,595 bytes per MUT second

  Productivity  95.3% of total user, 92.7% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0

real    0m4.144s
user    0m4.027s
sys     0m0.059s


More information about the Haskell-Cafe mailing list