Random crashes with memory corruption symptoms

Harendra Kumar harendra.kumar at gmail.com
Sun Feb 2 22:25:41 UTC 2020


Hi,

While running a test-suite for the streaming library streamly I am
encountering a crash which seems to happen at random places at different
times. The common messages are:

* Segmentation fault: 11
*  internal error: scavenge_mark_stack: unimplemented/strange closure type
24792696 @ 0x4200a623e0
* internal error: update_fwd: unknown/strange object  223743520

and several other such messages. Prima facie this looks like the memory is
getting corrupted/scribbled somehow. My first suspicion was that this could
be a problem in the streamly library code. But I have stripped down the
code to bare minimum and there is no C FFI code or no poking to memory
pointers.

My next suspicion was the hspec/quickcheck testing code that is being used
in this test. I checked the hspec code to ensure that there is no C
code/pointer poking in any of the code involved. But no luck there as well,
still looking to further strip down that code.

My suspicion now is moving more towards the GHC RTS. This issue only shows
when the following conditions are met:

* hspec "parallel" combinator is used to run tests in parallel
* streamly concurrent code is being tested which can create many threads
* The GHC heap size is restricted to a small size ~32MB using "-M32M"
rts option.
* It is consistently seen with GHC 8.6.5 as well as GHC 8.8.1

It never occurs when the heap size is not restricted. I have seen random
crashes before as well with a "IO manager die" message, when using
concurrent networking IO with streamly. Though earlier it was not easily
reproducible, I stopped chasing it. But now it looks like that issue might
also be a manifestation of the same underlying problem.

My guess is it could be something in the RTS concurrency/threading related
code. Let me know if the symptoms ring a bell or if you can point to
something specific based on the symptoms. Also, what are the usual
tools/methods/debugging aids/flags to debug such issues in GHC? If not a
GHC issue what are the possible ways in which such problem can be induced
by application code?

Meanwhile, I am also trying to simplify the reproducing code further to
remove other factors as much as possible. The current code is at
https://github.com/composewell/streamly on the ghc-segfault branch. Run "$
while true; do cabal run properties || break; done" in the shell and if you
are lucky it may crash soon. The test code is in "test/Prop.hs" - here
https://github.com/composewell/streamly/blob/ghc-segfault/test/Prop.hs .

-harendra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20200203/ac918fb5/attachment.html>


More information about the ghc-devs mailing list