Runtime performance degradation for multi-threaded C FFI callback

Mon Jan 23 13:49:16 CET 2012

Hi Simon,

I'm not certain that your explanation matches what I observed.

All of my tests were done on a 4-core machine, executing with "+RTS
-N", which should be the same as "+RTS -N4" I believe.

With 1 Haskell thread (the main thread) and 4 process threads (via
pthreads), I saw a significant performance degradation compared to 5
Haskell threads (main + 4 via forkIO) and 4 process threads.  As I
understand your explanation, if C callbacks are scheduled according to
available capabilities, there should be no difference between these
situations.

I observed this with GHC-7.2.1, however Daniel Fischer reported that,
with ghc-7.2.2, he observed different behavior (which matches your
explanation AFAICT).  Is it possible that the scheduling of callbacks
into Haskell changed between those versions?

Thanks,
John L.

> From: Simon Marlow <marlowsd at gmail.com>
> Subject: Re: Runtime performance degradation for multi-threaded C FFI
>        callback
> To: Sanket Agrawal <sanket.agrawal at gmail.com>
> Cc: glasgow-haskell-users <glasgow-haskell-users at haskell.org>
> Message-ID: <4F1D2F4D.9050709 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 21/01/2012 15:35, Sanket Agrawal wrote:
>> Hi Edward,
>>
>> I was just going to get back to you about it. I did find out that the
>> issue was indeed one GHC thread dealing with 5 C threads for callback
>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>> the only GHC thread to be available. I updated the code to do 1:1
>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>> linearly scalable.
>
> This is almost right, except that your callbacks are not waiting for a
> GHC *thread*, but what we call a "capability", which is roughly speaking
> "permission to execute Haskell code".  The +RTS -N option chooses the
> number of capabilities.
>
> I expect that with -N1, your program is spending a lot of time just
> switching between the different OS threads.
>
> It's possible that we could make the runtime more flexible here.  I
> recently made it possible to modify the number of capabilities at
> runtime, so it's conceivable that the runtime could automatically add
> capabilities if it is being called from multiple OS threads.
>
>> John Latos suggested the above approach two days back, but I didn't get
>> to test the idea until now.
>>
>> It doesn't seem to matter whether number of GHC threads are increased,
>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>> mapping by doing forkIO for each C thread. Is it really possible to do
>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>> during callback)? I can't think of a way to do it. Not that I need it. I
>> am just curious if that is possible.
>
> Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
> perfect sense to have 7 locks available for 5 threads.
>
> Cheers,
>        Simon