Runtime performance degradation for multi-threaded C FFI callback
marlowsd at gmail.com
Mon Jan 23 14:26:13 CET 2012
I'll need to analyse the program to see what's going on. There was a
small change to the scheduler between 7.2.1 and 7.2.2 that could
conceivably have made a difference in this scenario, but it was aimed at
fixing a bug rather than improvement performance.
Another possibility is a difference in OS scheduling behaviour between
yours and Daniel Fischer's setup. In microbenchmarks like this, it's
easy for a difference in OS scheduling behaviour to make a large
difference in performance if it happens consistently.
On 23/01/2012 12:49, John Lato wrote:
> Hi Simon,
> I'm not certain that your explanation matches what I observed.
> All of my tests were done on a 4-core machine, executing with "+RTS
> -N", which should be the same as "+RTS -N4" I believe.
> With 1 Haskell thread (the main thread) and 4 process threads (via
> pthreads), I saw a significant performance degradation compared to 5
> Haskell threads (main + 4 via forkIO) and 4 process threads. As I
> understand your explanation, if C callbacks are scheduled according to
> available capabilities, there should be no difference between these
> I observed this with GHC-7.2.1, however Daniel Fischer reported that,
> with ghc-7.2.2, he observed different behavior (which matches your
> explanation AFAICT). Is it possible that the scheduling of callbacks
> into Haskell changed between those versions?
> John L.
>> From: Simon Marlow<marlowsd at gmail.com>
>> Subject: Re: Runtime performance degradation for multi-threaded C FFI
>> To: Sanket Agrawal<sanket.agrawal at gmail.com>
>> Cc: glasgow-haskell-users<glasgow-haskell-users at haskell.org>
>> Message-ID:<4F1D2F4D.9050709 at gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> On 21/01/2012 15:35, Sanket Agrawal wrote:
>>> Hi Edward,
>>> I was just going to get back to you about it. I did find out that the
>>> issue was indeed one GHC thread dealing with 5 C threads for callback
>>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>>> the only GHC thread to be available. I updated the code to do 1:1
>>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>>> linearly scalable.
>> This is almost right, except that your callbacks are not waiting for a
>> GHC *thread*, but what we call a "capability", which is roughly speaking
>> "permission to execute Haskell code". The +RTS -N option chooses the
>> number of capabilities.
>> I expect that with -N1, your program is spending a lot of time just
>> switching between the different OS threads.
>> It's possible that we could make the runtime more flexible here. I
>> recently made it possible to modify the number of capabilities at
>> runtime, so it's conceivable that the runtime could automatically add
>> capabilities if it is being called from multiple OS threads.
>>> John Latos suggested the above approach two days back, but I didn't get
>>> to test the idea until now.
>>> It doesn't seem to matter whether number of GHC threads are increased,
>>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>>> mapping by doing forkIO for each C thread. Is it really possible to do
>>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>>> during callback)? I can't think of a way to do it. Not that I need it. I
>>> am just curious if that is possible.
>> Just think of +RTS -N7 as being 7 *locks*, not 7 threads. Then it makes
>> perfect sense to have 7 locks available for 5 threads.
More information about the Glasgow-haskell-users