ghci and ghc -threaded [slowdown]

Mon Dec 15 10:02:14 EST 2008

Malcolm Wallace wrote:
>> It seems that the problem you have is that moving to the multithreaded
>> runtime imposes an overhead on the communication between your two
>> threads,  when run on a *single CPU*.  But performance on a single CPU
>> is not what  you're interested in - you said you wanted parallelism,
>> and for that you  need multiple CPUs, and hence multiple OS threads.
> 
> Well, I'm interested in getting an absolute speedup.  If the threaded
> performance on a single core is slightly slower than the non-threaded
> performance on a single core, that would be OK provided that the
> threaded performance using multiple cores was better than the same
> non-threaded baseline.
> 
> However, it doesn't seem to work like that at all.  In fact, threaded on
> multiple cores was _even_slower_ than threaded on a single core!

Entirely possible - unless there's any actual parallelism, running on 
multiple cores will probably slow things down due to thread migration.

> Here are some figures:
> 
>     ghc-6.8.2 -O2  
>                  apply   MVar    strict  thr-N2  thr-N1
>     silicium      7.30    7.95     7.23   15.25  14.71
>     neghip        4.25    4.43     4.18    6.67   6.48
>     hydrogen     11.75   10.82    10.99   13.45  12.96
>     lobster      55.8    51.5     57.6    76.6   74.5
> 
> The first three columns are variations of the program using slightly
> different communications mechanisms, including threads/MVars with the
> non-threaded RTS.  The final two columns are for the MVar mechanism
> with threaded RTS and either 1 or 2 cores.  -N2 is slowest.

So you're not getting any parallelism at all, for some reason your program 
is sequentialised.  There could be any number of reasons for this.

>> I suspect the underlying problem in your program is that the
>> communication  is synchronous.  To get good parallelism you'll need to
>> use asynchronous  communication, otherwise even on multiple CPUs
>> you'll see little  parallelism.
> 
> I tried using Chans instead of MVars, to provide for different speeds of
> reader/writer, but the timings were even worse.  (Add another 15-100%.)

That would seem to indicate that your program is doing a lot of 
communication - I'd look at trying to reduce that, by increasing task size 
or whatever.  However, the amount of communication is obviously not the 
only issue, there also seems to be some kind of dependency that 
sequentialises the program.

Are you sure that you're not accidentally communicating thunks, and hence 
doing all the computation in one of the threads?  That's a common pitfall 
that has caught me more than once.

Do you know roughly the amount of parallelism you expect - i.e. the amount 
of work done by each thread?

> When I have time to look at this again (probably in the New Year), I
> will try some other strategies for communication that vary in their
> synchronous/asynchronous chunk size, to see if I can pin things down
> more closely.

That would be good.  At some point we hope to provide some kind of 
visualisation to let you see where the parallel performance bottlenecks in 
your program are; there are various ongoing efforts but nothing useable as yet.

Cheers,
	Simon