Runtime performance degradation for multi-threaded C FFI callback
jwlato at gmail.com
Mon Jan 23 13:49:16 CET 2012
I'm not certain that your explanation matches what I observed.
All of my tests were done on a 4-core machine, executing with "+RTS
-N", which should be the same as "+RTS -N4" I believe.
With 1 Haskell thread (the main thread) and 4 process threads (via
pthreads), I saw a significant performance degradation compared to 5
Haskell threads (main + 4 via forkIO) and 4 process threads. As I
understand your explanation, if C callbacks are scheduled according to
available capabilities, there should be no difference between these
I observed this with GHC-7.2.1, however Daniel Fischer reported that,
with ghc-7.2.2, he observed different behavior (which matches your
explanation AFAICT). Is it possible that the scheduling of callbacks
into Haskell changed between those versions?
> From: Simon Marlow <marlowsd at gmail.com>
> Subject: Re: Runtime performance degradation for multi-threaded C FFI
> To: Sanket Agrawal <sanket.agrawal at gmail.com>
> Cc: glasgow-haskell-users <glasgow-haskell-users at haskell.org>
> Message-ID: <4F1D2F4D.9050709 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> On 21/01/2012 15:35, Sanket Agrawal wrote:
>> Hi Edward,
>> I was just going to get back to you about it. I did find out that the
>> issue was indeed one GHC thread dealing with 5 C threads for callback
>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>> the only GHC thread to be available. I updated the code to do 1:1
>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>> linearly scalable.
> This is almost right, except that your callbacks are not waiting for a
> GHC *thread*, but what we call a "capability", which is roughly speaking
> "permission to execute Haskell code". The +RTS -N option chooses the
> number of capabilities.
> I expect that with -N1, your program is spending a lot of time just
> switching between the different OS threads.
> It's possible that we could make the runtime more flexible here. I
> recently made it possible to modify the number of capabilities at
> runtime, so it's conceivable that the runtime could automatically add
> capabilities if it is being called from multiple OS threads.
>> John Latos suggested the above approach two days back, but I didn't get
>> to test the idea until now.
>> It doesn't seem to matter whether number of GHC threads are increased,
>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>> mapping by doing forkIO for each C thread. Is it really possible to do
>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>> during callback)? I can't think of a way to do it. Not that I need it. I
>> am just curious if that is possible.
> Just think of +RTS -N7 as being 7 *locks*, not 7 threads. Then it makes
> perfect sense to have 7 locks available for 5 threads.
More information about the Glasgow-haskell-users