[Haskell-cafe] Parallel weirdness [new insights]

Sun Apr 20 15:41:52 EDT 2008

OK, well I now have so much data sitting in from of me I don't even know 
*what* I'm seeing any more. I have made several significant discoveries 
though...

Consider the following:

  msort [] = []
  msort [x] = [x]
  msort xs =
    let
      xs0 = msort (split0 xs)
      xs1 = msort (split1 xs)
    in merge xs0 xs1

This takes roughly 14 seconds to sort a list of one million Word32 
values. If I now change the final line to read

  in listSeq rwhnf xs0 `seq` listSeq rwhnf xs1 `seq` merge xs0 xs1

it now takes 8 seconds to do the same job.

Notice that this is still completely sequential execution. It's just 
executing in a different order. (And, at first glance, doing slightly 
more work.) Of all the benchmarks I've performed, I have yet to find 
anything that goes faster than this. If I make it so that xs0 is 
computed in parallel with xs1 instead of in series, then it goes at 
roughly the same speed (but with more variation) if the number of real 
threads is 1. If you add more real threads, execution slows down. (Go 
figure!) I was expecting running parallel at just the top few levels and 
then switching to pure sequential for the lower levels to give the best 
performance. But the numbers I have seem to say that more parallel = 
slower, with 100% sequential giving me the fastest time of all.

The next insight happens when you look at the GC statistics. Both the 
unmarked and the explicitly sequential program are giving me roughly 55% 
GC time and 45% user time. (!!) Obviously this is a Very Bad Thing. I 
discovered that simply by adding -H200m to the command line, I can make 
both programs speed up by about 20% or so. (They then drop down to 
roughly 25% GC time. Adding more RAM doesn't seem to make any difference.)

I had assumed that the explicitly sequential program was faster because 
it was somehow demanding less GC time, but that doesn't appear to be the 
case - both GC time and user time are lower for the explicitly 
sequential version. And adding more heap space doesn't make the (large) 
speed difference go away. Is the strictness of the seq operator making 
GHC come up with different a Core implementation for this function or 
something? I have no idea.

With the extra heap space, the speed difference between the sequential 
and parallel programs becomes smaller. The sequential version *is* still 
faster, however. I have no explanation for why that might be. Adding 
more heap also seems to make the runtimes more variable. (I run each 
test 8 times. One test, the fastest run was 7 seconds and the slowest 
was 11 seconds. That's quite a variation. The sequential algorithm only 
varies by a few milliseconds each time.)

In short, it seems my little sorting algorithm test is *actually* just 
stressing out the GC engine, and I'm "really" benchmarking different GC 
settings. :-(

Questions:

1. Does running the GC force all threads to stop? I know some GC designs 
do this, but I have no idea how the one GHC implements works.

2. Is the GC still single-threaded? (GHC 6.8.2 here.)

3. Is there any way for a running Haskell program to find out how much 
heap space is currently allocated / used / free? I know you can find out 
how much wall time and CPU time you've used, but I couldn't find 
anything for RAM usage.