[Haskell-cafe] Re: forkIO on multicore
marlowsd at gmail.com
Mon Dec 22 09:34:25 EST 2008
Duncan Coutts wrote:
> On Fri, 2008-12-19 at 10:42 -0600, Jake McArthur wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> Paul Keir wrote:
>>> fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
>> This is a CAF (Constant Applicative Form). Since it is actually a
>> constant it is never garbage collected, and is always shared, so each
>> thread is only calculating it once. You have essentially created a
>> lookup table.
> Though note that with all our obvious suggestions there is still no
> heavytask m n = putMVar m $! (fibs !! 100000)
> fibs = n : (n+1) : zipWith (+) fibs (tail fibs)
> -- so now fibs is not globally shared but is used per-heavytask
> -- it is also evaluated by heavy task rather than just putting a thunk
> -- into the MVar
> main = do ms <- sequence $ replicate 8 newEmptyMVar
> [ forkIO (heavytask m n)
> | (m, n) <- zip ms [0..] ]
> ms' <- mapM takeMVar ms
> mapM_ print ms'
> Looking at the GC stats (+RTS -t -RTS) we see that the majority of the
> time in this program is spent doing GC and that when we run with -N4 the
> time spent doing GC is even higher.
This is an interesting example. It shows up a weakness in the GC that I'm
working on fixing right now.
The interesting aspect of this example is that the thread stacks get large.
You can see this by using +RTS -hT: a large chunk of the heap is taken up
by TSOs. Each of those (fibs !! 100000) requires linear stack, because
(fibs 100000) depends on (fibs 99999), and so on. That could probably be
fixed by adding some strictness, but that's not the goal here - we should
still be able to run the program in parallel.
So when there are large stacks around, GC takes a long time because it has
to traverse stacks. But we should be able to alleviate the problem by (a)
using a larger heap, and (b) using parallel GC. Not doing parallel GC is
seriously going to hurt peformance, because the data will have to be moved
from one CPU's cache to another. But it turns out that parallel GC is
misbehaving on this example too, because it doesn't force each stack to be
scanned by the same thread that is executing it - I'm working on fixing that.
Having each CPU be able to GC independently would be a big improvement, of
course. We think we understand how this can be done in the context of GHC,
it's just a matter of doing it, but it's a big job.
Parallel performance is something that we expect to make dramatic
improvements over the next few months as we investigate more programs and
improve the tools. The current HEAD is already a lot better than 6.10.1.
More information about the Haskell-Cafe