Increasing number of worker tasks in RTS (GHC 7.4.1) - how to debug?

Sanket Agrawal sanket.agrawal at gmail.com
Tue Feb 28 13:17:12 CET 2012


>
> What version of GHC is this?  I vaguely remember fixing something like
>> this.
>
>
> The rule of thumb is: if you think it is a bug then report it, and we'll
> investigate further.
>

Simon, it is in GHC 7.4.1. Yes, you fixed a bug #4262 ("GHC's runtime never
terminates worker threads"). I have filed the bug report #5897, with code
to reproduce it.

This bug seems to be due to mvar callback from C FFI. If I remove mvar
callback, the number of workers stay constant. But, it happens only if C
FFI thread count exceed a threshold, 6 in my case. Also, I can consistently
crash the code with segmentation fault/bus error on Mac if I increase the
number of C FFI threads. On Linux too, the crash happens but not as often.

This seems to be a big bug in my opinion because mvar callback is important
for coordination between GHC threads and C FFI threads. I can work around
it for now, by keeping the number of C FFI threads below the threshold that
triggers the bug. I suspect this bug has been in GHC all along, but wasn't
discovered until now because it happens only if C FFI thread count cross a
threshold, and mvar callback is involved.



>
> Cheers,
>        Simon
>
>
>
>> On Sat, Feb 25, 2012 at 3:41 PM, Sanket Agrawal
>> <sanket.agrawal at gmail.com <mailto:sanket.agrawal at gmail.**com<sanket.agrawal at gmail.com>>>
>> wrote:
>>
>>    On further investigation, it seems to be very specific to Mac OS
>>    Lion (I am running 10.7.3) - all tests were with -N3 option:
>>
>>    - I can reliably crash the code with seg fault or bus error if I
>>    create more than 8 threads in C FFI (each thread creates its own
>>    mutex, for 1-1 coordination with Haskell timer thread). My iMac has
>>    4 processors. In gdb, I can see that the crash happened
>>    in __psynch_cvsignal () which seems to be related to pthread mutex.
>>
>>    - If I increase the number of C FFI threads (and hence, pthread
>>    mutexes) to >=7, the number of tasks starts increasing. 8 is the max
>>    number of FFI threads in my testing where the code runs without
>>    crashing. But, it seems that there is some kind of pthread mutex
>>    related leak. What the timer thread does is to fork 8 parallel
>>    haskell threads to acquire mutexes from each of the C FFI thread.
>>    Though the function returns after acquiring, collecting data, and
>>    releasing mutex, some of the threads seem to be marked as active by
>>    GC, because of mutex memory leak. Exactly how, I don't know.
>>
>>    - If I keep the number of C FFI threads to <=6, there is no memory
>>    leak. The number of tasks stays steady.
>>
>>    So, it seems to be pthread library issue (and not a GHC issue).
>>    Something to keep in mind when developing code on Mac that involves
>>    mutex coordination with C FFI.
>>
>>
>>    On Sat, Feb 25, 2012 at 2:59 PM, Sanket Agrawal
>>    <sanket.agrawal at gmail.com <mailto:sanket.agrawal at gmail.**com<sanket.agrawal at gmail.com>>>
>> wrote:
>>
>>        I wrote a program that uses a timed thread to collect data from
>>        a C producer (using FFI). The number of threads in C producer
>>        are fixed (and created at init). One haskell timer thread uses
>>        threadDelay to run itself on timed interval. When I look at RTS
>>        output after killing the program after couple of timer
>>        iterations, I see number of worker tasks increasing with time.
>>
>>          For example, below is an output after 20 iterations of timer
>>        event:
>>
>>                               MUT time (elapsed)       GC time  (elapsed)
>>           Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>           Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>           .......output until task 37 snipped as it is same as task
>>        1.......
>>           Task 38 (worker) :    0.07s    (  0.09s)       0.00s    (
>>  0.00s)
>>           Task 39 (worker) :    0.07s    (  0.09s)       0.00s    (
>>  0.00s)
>>           Task 40 (worker) :    0.18s    ( 10.20s)       0.00s    (
>>  0.00s)
>>           Task 41 (worker) :    0.18s    ( 10.20s)       0.00s    (
>>  0.00s)
>>           Task 42 (worker) :    0.18s    ( 10.20s)       0.00s    (
>>  0.00s)
>>           Task 43 (worker) :    0.18s    ( 10.20s)       0.00s    (
>>  0.00s)
>>           Task 44 (worker) :    0.52s    ( 10.74s)       0.00s    (
>>  0.00s)
>>           Task 45 (worker) :    0.52s    ( 10.75s)       0.00s    (
>>  0.00s)
>>           Task 46 (worker) :    0.52s    ( 10.75s)       0.00s    (
>>  0.00s)
>>           Task 47 (bound)  :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>
>>
>>        After two iterations of timer event:
>>
>>                                MUT time (elapsed)       GC time  (elapsed)
>>           Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>           Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>           Task  2 (worker) :    0.07s    (  0.09s)       0.00s    (
>>  0.00s)
>>           Task  3 (worker) :    0.07s    (  0.09s)       0.00s    (
>>  0.00s)
>>           Task  4 (worker) :    0.16s    (  1.21s)       0.00s    (
>>  0.00s)
>>           Task  5 (worker) :    0.16s    (  1.21s)       0.00s    (
>>  0.00s)
>>           Task  6 (worker) :    0.16s    (  1.21s)       0.00s    (
>>  0.00s)
>>           Task  7 (worker) :    0.16s    (  1.21s)       0.00s    (
>>  0.00s)
>>           Task  8 (worker) :    0.48s    (  1.80s)       0.00s    (
>>  0.00s)
>>           Task  9 (worker) :    0.48s    (  1.81s)       0.00s    (
>>  0.00s)
>>           Task 10 (worker) :    0.48s    (  1.81s)       0.00s    (
>>  0.00s)
>>           Task 11 (bound)  :    0.00s    (  0.00s)       0.00s    (
>>  0.00s)
>>
>>
>>        Haskell code has one forkIO call to kick off C FFI - C FFI
>>        creates 8 threads. Runtime options are "-N3 +RTS -s". timer
>>        event is kicked off after forkIO. It is for the form (pseudo-code):
>>
>>        timerevent <other arguments> time = run where run = do
>>        threadDelay time >> do some work >> run where <other variables
>>        defined for run function>
>>
>>        I also wrote a simpler code using just timer event (fork one
>>        timer event, and run another timer event after that), but didn't
>>        see any tasks in RTS output.
>>
>>        I tried searching GHC page for documentation on RTS output, but
>>        didn't find anything that could help me debug above issue. I
>>        suspect that timer event is the root cause of increasing number
>>        of tasks (with all but last 9 tasks idle -  I guess 8 tasks
>>        belong to C FFI, and one task to timerevent thread), and hence,
>>        memory leak.
>>
>>        I will appreciate pointers on how to debug it. The timerevent
>>        does forkIO a call to send collected data from C FFI to a db
>>        server, but disabling that fork still results in the issue of
>>        increasing number of tasks. So, it seems strongly correlated
>>        with timer event though I am unable to reproduce it with a
>>        simpler version of timer event (which removes mvar sync/callback
>>        from C FFI).
>>
>>
>>
>>
>>
>> ______________________________**_________________
>> Glasgow-haskell-users mailing list
>> Glasgow-haskell-users at haskell.**org <Glasgow-haskell-users at haskell.org>
>> http://www.haskell.org/**mailman/listinfo/glasgow-**haskell-users<http://www.haskell.org/mailman/listinfo/glasgow-haskell-users>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20120228/1565b7d2/attachment-0001.htm>


More information about the Glasgow-haskell-users mailing list