Runtime performance degradation for multi-threaded C FFI callback

Mon Jan 23 14:26:13 CET 2012

I'll need to analyse the program to see what's going on.  There was a 
small change to the scheduler between 7.2.1 and 7.2.2 that could 
conceivably have made a difference in this scenario, but it was aimed at 
fixing a bug rather than improvement performance.

Another possibility is a difference in OS scheduling behaviour between 
yours and Daniel Fischer's setup.  In microbenchmarks like this, it's 
easy for a difference in OS scheduling behaviour to make a large 
difference in performance if it happens consistently.

Cheers,
	Simon

On 23/01/2012 12:49, John Lato wrote:
> Hi Simon,
>
> I'm not certain that your explanation matches what I observed.
>
> All of my tests were done on a 4-core machine, executing with "+RTS
> -N", which should be the same as "+RTS -N4" I believe.
>
> With 1 Haskell thread (the main thread) and 4 process threads (via
> pthreads), I saw a significant performance degradation compared to 5
> Haskell threads (main + 4 via forkIO) and 4 process threads.  As I
> understand your explanation, if C callbacks are scheduled according to
> available capabilities, there should be no difference between these
> situations.
>
> I observed this with GHC-7.2.1, however Daniel Fischer reported that,
> with ghc-7.2.2, he observed different behavior (which matches your
> explanation AFAICT).  Is it possible that the scheduling of callbacks
> into Haskell changed between those versions?
>
> Thanks,
> John L.
>
>> From: Simon Marlow<marlowsd at gmail.com>
>> Subject: Re: Runtime performance degradation for multi-threaded C FFI
>>         callback
>> To: Sanket Agrawal<sanket.agrawal at gmail.com>
>> Cc: glasgow-haskell-users<glasgow-haskell-users at haskell.org>
>> Message-ID:<4F1D2F4D.9050709 at gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 21/01/2012 15:35, Sanket Agrawal wrote:
>>> Hi Edward,
>>>
>>> I was just going to get back to you about it. I did find out that the
>>> issue was indeed one GHC thread dealing with 5 C threads for callback
>>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>>> the only GHC thread to be available. I updated the code to do 1:1
>>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>>> linearly scalable.
>>
>> This is almost right, except that your callbacks are not waiting for a
>> GHC *thread*, but what we call a "capability", which is roughly speaking
>> "permission to execute Haskell code".  The +RTS -N option chooses the
>> number of capabilities.
>>
>> I expect that with -N1, your program is spending a lot of time just
>> switching between the different OS threads.
>>
>> It's possible that we could make the runtime more flexible here.  I
>> recently made it possible to modify the number of capabilities at
>> runtime, so it's conceivable that the runtime could automatically add
>> capabilities if it is being called from multiple OS threads.
>>
>>> John Latos suggested the above approach two days back, but I didn't get
>>> to test the idea until now.
>>>
>>> It doesn't seem to matter whether number of GHC threads are increased,
>>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>>> mapping by doing forkIO for each C thread. Is it really possible to do
>>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>>> during callback)? I can't think of a way to do it. Not that I need it. I
>>> am just curious if that is possible.
>>
>> Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
>> perfect sense to have 7 locks available for 5 threads.
>>
>> Cheers,
>>         Simon