Nondeterministic Failure on aarch64 with -jn, n > 1

Ben Gamari ben at smart-cactus.org
Fri Jul 27 13:51:59 UTC 2018


Travis Whitaker <pi.boy.travis at gmail.com> writes:

> Hello GHC Devs,
>
> It seems to me that GHC is rather broken on aarch64, at least since 8.2.1
> (and at least on the machines I have access to). I first noticed this issue
> with Nixpkgs (https://github.com/NixOS/nixpkgs/issues/40301), so to check
> that this isn't some Nixpkgs idiosyncrasy I went ahead and built my own GHC
> 8.4.3 for aarch64 (there's no binary release at
> https://www.haskell.org/ghc/download_ghc_8_4_3.html to try, but perhaps
> I've missed something.
>
> It seems the only Nix idiosyncrasy was passing "--ghc-option=-j${cores}" to
> "./Setup.hs configure". The issue is triggered by using '-jn' for any n
> greater than one when building any non-trivial package, but I've found
> hscolour1.24.4 reproduces it very reliably (perhaps because there are
> opportunities for parallelism early in its module dependency graph?). GHC
> very often (although not always) will fail with one of:
>
> - Segmentation fault.
> - Bus fault
> - <no location info>: error:
>     ghc: panic! (the 'impossible' happened)
>   (GHC version 8.4.3 for aarch64-unknown-linux):
>         Binary.UserData: no put_binding_name
>
> - ghc: internal error: MUT_VAR_CLEAN object entered!
>     (GHC version 8.4.3 for aarch64_unknown_linux)
>     Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
> Aborted (core dumped)
>
Ugh, that is awful.

> The fix, excruciating as it may be on already slow arm machines, is to use
> '-j1'. This issue seems present on each GHC release since 8.2.1 (although I
> haven't tried HEAD yet). I haven't noticed any issues with any other
> concurrent Haskell programs on aarch64.
>
> There are some umbrella bugs for aarch64 in Trac, so I wanted to ask here
> before filing a ticket. Has anyone else noticed this behavior on aarch64?
> What's more, are there any tips for using GDB to hunt down synchronization
> issues in GHC?
>
Definitely open a new ticket.

The methodology for tracking down issues like this is quite
case-specific but I do have some general recommendations: On x86-64 I
use rr [1], which is an invaluable tool. Sadly this isn't an option on
AArch64 AFAIK. I also have some gdb extensions to take much of the
monotony away from inspecting GHC's heap and internal data structures
[2]. I've not used them on AArch64 so there may be a few compatibility
issues but I suspect they wouldn't be hard to fix.

I know it may be hard in this case but I would at least try to reduce
the size of the failing program to something that fits in less than a
few hundred lines. Low-level debugging is hard enough when you can keep
the program in your head; debugging all of GHC this way is possible but
much harder. Given that this appears to be threading-specific, I would
also pay particular attention to the GHC and base's use of barriers and
atomics. It's possible that we are just missing a barrier somewhere.

Finally, you might quickly try building 8.0 to see whether bisection is
a possibility. It would be a slow process, given the speed of the
hardware involved, but ultimately it can be much more time efficient
once you have it setup since you can replace human debugging time (a
very finite commodity) with computation.

Good luck and let us know if you get stuck,

- Ben


[1] http://rr-project.org/
[2] https://github.com/bgamari/ghc-utils/tree/master/gdb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 487 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20180727/2fbf2162/attachment.sig>


More information about the ghc-devs mailing list