[Haskell-cafe] Parallel weirdness [new insights]
Andrew Coppin
andrewcoppin at btinternet.com
Sun Apr 20 15:41:52 EDT 2008
OK, well I now have so much data sitting in from of me I don't even know
*what* I'm seeing any more. I have made several significant discoveries
though...
Consider the following:
msort [] = []
msort [x] = [x]
msort xs =
let
xs0 = msort (split0 xs)
xs1 = msort (split1 xs)
in merge xs0 xs1
This takes roughly 14 seconds to sort a list of one million Word32
values. If I now change the final line to read
in listSeq rwhnf xs0 `seq` listSeq rwhnf xs1 `seq` merge xs0 xs1
it now takes 8 seconds to do the same job.
Notice that this is still completely sequential execution. It's just
executing in a different order. (And, at first glance, doing slightly
more work.) Of all the benchmarks I've performed, I have yet to find
anything that goes faster than this. If I make it so that xs0 is
computed in parallel with xs1 instead of in series, then it goes at
roughly the same speed (but with more variation) if the number of real
threads is 1. If you add more real threads, execution slows down. (Go
figure!) I was expecting running parallel at just the top few levels and
then switching to pure sequential for the lower levels to give the best
performance. But the numbers I have seem to say that more parallel =
slower, with 100% sequential giving me the fastest time of all.
The next insight happens when you look at the GC statistics. Both the
unmarked and the explicitly sequential program are giving me roughly 55%
GC time and 45% user time. (!!) Obviously this is a Very Bad Thing. I
discovered that simply by adding -H200m to the command line, I can make
both programs speed up by about 20% or so. (They then drop down to
roughly 25% GC time. Adding more RAM doesn't seem to make any difference.)
I had assumed that the explicitly sequential program was faster because
it was somehow demanding less GC time, but that doesn't appear to be the
case - both GC time and user time are lower for the explicitly
sequential version. And adding more heap space doesn't make the (large)
speed difference go away. Is the strictness of the seq operator making
GHC come up with different a Core implementation for this function or
something? I have no idea.
With the extra heap space, the speed difference between the sequential
and parallel programs becomes smaller. The sequential version *is* still
faster, however. I have no explanation for why that might be. Adding
more heap also seems to make the runtimes more variable. (I run each
test 8 times. One test, the fastest run was 7 seconds and the slowest
was 11 seconds. That's quite a variation. The sequential algorithm only
varies by a few milliseconds each time.)
In short, it seems my little sorting algorithm test is *actually* just
stressing out the GC engine, and I'm "really" benchmarking different GC
settings. :-(
Questions:
1. Does running the GC force all threads to stop? I know some GC designs
do this, but I have no idea how the one GHC implements works.
2. Is the GC still single-threaded? (GHC 6.8.2 here.)
3. Is there any way for a running Haskell program to find out how much
heap space is currently allocated / used / free? I know you can find out
how much wall time and CPU time you've used, but I couldn't find
anything for RAM usage.
More information about the Haskell-Cafe
mailing list