[Haskell-cafe] Re: forkIO on multicore

Mon Dec 22 09:34:25 EST 2008

Duncan Coutts wrote:
> On Fri, 2008-12-19 at 10:42 -0600, Jake McArthur wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Paul Keir wrote:
>>> fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
>> This is a CAF (Constant Applicative Form). Since it is actually a
>> constant it is never garbage collected, and is always shared, so each
>> thread is only calculating it once. You have essentially created a
>> lookup table.
> 
> Though note that with all our obvious suggestions there is still no
> speedup:
> 
> heavytask m n = putMVar m $! (fibs !! 100000)
>   where
>     fibs = n : (n+1) : zipWith (+) fibs (tail fibs)
> 
> -- so now fibs is not globally shared but is used per-heavytask
> -- it is also evaluated by heavy task rather than just putting a thunk
> -- into the MVar
> 
> main = do ms <- sequence $ replicate 8 newEmptyMVar
>           sequence_
>             [ forkIO (heavytask m n)
>             | (m, n) <- zip ms [0..] ]
>           ms' <- mapM takeMVar ms
>           mapM_ print ms'
> 
> Looking at the GC stats (+RTS -t -RTS) we see that the majority of the
> time in this program is spent doing GC and that when we run with -N4 the
> time spent doing GC is even higher.

This is an interesting example.  It shows up a weakness in the GC that I'm 
working on fixing right now.

The interesting aspect of this example is that the thread stacks get large. 
  You can see this by using +RTS -hT: a large chunk of the heap is taken up 
by TSOs. Each of those (fibs !! 100000) requires linear stack, because 
(fibs 100000) depends on (fibs 99999), and so on.  That could probably be 
fixed by adding some strictness, but that's not the goal here - we should 
still be able to run the program in parallel.

So when there are large stacks around, GC takes a long time because it has 
to traverse stacks.  But we should be able to alleviate the problem by (a) 
using a larger heap, and (b) using parallel GC.  Not doing parallel GC is 
seriously going to hurt peformance, because the data will have to be moved 
from one CPU's cache to another.  But it turns out that parallel GC is 
misbehaving on this example too, because it doesn't force each stack to be 
scanned by the same thread that is executing it - I'm working on fixing that.

Having each CPU be able to GC independently would be a big improvement, of 
course.  We think we understand how this can be done in the context of GHC, 
it's just a matter of doing it, but it's a big job.

Parallel performance is something that we expect to make dramatic 
improvements over the next few months as we investigate more programs and 
improve the tools.  The current HEAD is already a lot better than 6.10.1.

Cheers,
	Simon