Increasing number of worker tasks in RTS (GHC 7.4.1) - how to debug?

Sun Feb 26 03:23:50 CET 2012

I have to take back what I said about the increase in worker tasks being
related to some Mac OS pthread bug. I can now reproduce the issue on Linux
(Redhat x86_64) too (and cause a segmentation fault once in a while). So,
now, it seems the issue might be due to either some kind of interaction
between GHC RTS, and C pthread mutexes, or a bug in my code.

What I have done is to create a simple test case that reproduces the
increase in number of worker threads with each run of Haskell timer thread
(that syncs with C pthreads). I have put up the code on github with
documentation on how to reproduce the issue:
https://github.com/sanketr/cffitest

I will appreciate feedback on whether it is a bug in my code, or a GHC bug
that needs to be reported.

On Sat, Feb 25, 2012 at 3:41 PM, Sanket Agrawal <sanket.agrawal at gmail.com>wrote:

> On further investigation, it seems to be very specific to Mac OS Lion (I
> am running 10.7.3) - all tests were with -N3 option:
>
> - I can reliably crash the code with seg fault or bus error if I create
> more than 8 threads in C FFI (each thread creates its own mutex, for 1-1
> coordination with Haskell timer thread). My iMac has 4 processors. In gdb,
> I can see that the crash happened in __psynch_cvsignal () which seems to be
> related to pthread mutex.
>
> - If I increase the number of C FFI threads (and hence, pthread mutexes)
> to >=7, the number of tasks starts increasing. 8 is the max number of FFI
> threads in my testing where the code runs without crashing. But, it seems
> that there is some kind of pthread mutex related leak. What the timer
> thread does is to fork 8 parallel haskell threads to acquire mutexes from
> each of the C FFI thread. Though the function returns after acquiring,
> collecting data, and releasing mutex, some of the threads seem to be marked
> as active by GC, because of mutex memory leak. Exactly how, I don't know.
>
> - If I keep the number of C FFI threads to <=6, there is no memory leak.
> The number of tasks stays steady.
>
> So, it seems to be pthread library issue (and not a GHC issue). Something
> to keep in mind when developing code on Mac that involves mutex
> coordination with C FFI.
>
>
> On Sat, Feb 25, 2012 at 2:59 PM, Sanket Agrawal <sanket.agrawal at gmail.com>wrote:
>
>> I wrote a program that uses a timed thread to collect data from a C
>> producer (using FFI). The number of threads in C producer are fixed (and
>> created at init). One haskell timer thread uses threadDelay to run itself
>> on timed interval. When I look at RTS output after killing the program
>> after couple of timer iterations, I see number of worker tasks increasing
>> with time.
>>
>>  For example, below is an output after 20 iterations of timer event:
>>
>>                       MUT time (elapsed)       GC time  (elapsed)
>>   Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>   Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>   .......output until task 37 snipped as it is same as task 1.......
>>   Task 38 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
>>   Task 39 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
>>   Task 40 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
>>   Task 41 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
>>   Task 42 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
>>   Task 43 (worker) :    0.18s    ( 10.20s)       0.00s    (  0.00s)
>>   Task 44 (worker) :    0.52s    ( 10.74s)       0.00s    (  0.00s)
>>   Task 45 (worker) :    0.52s    ( 10.75s)       0.00s    (  0.00s)
>>   Task 46 (worker) :    0.52s    ( 10.75s)       0.00s    (  0.00s)
>>   Task 47 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>
>>
>> After two iterations of timer event:
>>
>>                        MUT time (elapsed)       GC time  (elapsed)
>>   Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>   Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>   Task  2 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
>>   Task  3 (worker) :    0.07s    (  0.09s)       0.00s    (  0.00s)
>>   Task  4 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
>>   Task  5 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
>>   Task  6 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
>>   Task  7 (worker) :    0.16s    (  1.21s)       0.00s    (  0.00s)
>>   Task  8 (worker) :    0.48s    (  1.80s)       0.00s    (  0.00s)
>>   Task  9 (worker) :    0.48s    (  1.81s)       0.00s    (  0.00s)
>>   Task 10 (worker) :    0.48s    (  1.81s)       0.00s    (  0.00s)
>>   Task 11 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)
>>
>>
>> Haskell code has one forkIO call to kick off C FFI - C FFI creates 8
>> threads. Runtime options are "-N3 +RTS -s". timer event is kicked off after
>> forkIO. It is for the form (pseudo-code):
>>
>> timerevent <other arguments> time = run where run = do threadDelay time
>> >> do some work >> run where <other variables defined for run function>
>>
>> I also wrote a simpler code using just timer event (fork one timer event,
>> and run another timer event after that), but didn't see any tasks in RTS
>> output.
>>
>> I tried searching GHC page for documentation on RTS output, but didn't
>> find anything that could help me debug above issue. I suspect that timer
>> event is the root cause of increasing number of tasks (with all but last 9
>> tasks idle -  I guess 8 tasks belong to C FFI, and one task to timerevent
>> thread), and hence, memory leak.
>>
>> I will appreciate pointers on how to debug it. The timerevent does forkIO
>> a call to send collected data from C FFI to a db server, but disabling that
>> fork still results in the issue of increasing number of tasks. So, it seems
>> strongly correlated with timer event though I am unable to reproduce it
>> with a simpler version of timer event (which removes mvar sync/callback
>> from C FFI).
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20120225/b237c3d9/attachment-0001.htm>