Nondeterministic Failure on aarch64 with -jn, n > 1

Travis Whitaker pi.boy.travis at gmail.com
Fri Jul 27 20:18:27 UTC 2018


Thanks so much for the pointers, Ben.

I opened a ticket here https://ghc.haskell.org/trac/ghc/ticket/15449

On Fri, Jul 27, 2018 at 6:51 AM, Ben Gamari <ben at smart-cactus.org> wrote:

> Travis Whitaker <pi.boy.travis at gmail.com> writes:
>
> > Hello GHC Devs,
> >
> > It seems to me that GHC is rather broken on aarch64, at least since 8.2.1
> > (and at least on the machines I have access to). I first noticed this
> issue
> > with Nixpkgs (https://github.com/NixOS/nixpkgs/issues/40301), so to
> check
> > that this isn't some Nixpkgs idiosyncrasy I went ahead and built my own
> GHC
> > 8.4.3 for aarch64 (there's no binary release at
> > https://www.haskell.org/ghc/download_ghc_8_4_3.html to try, but perhaps
> > I've missed something.
> >
> > It seems the only Nix idiosyncrasy was passing "--ghc-option=-j${cores}"
> to
> > "./Setup.hs configure". The issue is triggered by using '-jn' for any n
> > greater than one when building any non-trivial package, but I've found
> > hscolour1.24.4 reproduces it very reliably (perhaps because there are
> > opportunities for parallelism early in its module dependency graph?). GHC
> > very often (although not always) will fail with one of:
> >
> > - Segmentation fault.
> > - Bus fault
> > - <no location info>: error:
> >     ghc: panic! (the 'impossible' happened)
> >   (GHC version 8.4.3 for aarch64-unknown-linux):
> >         Binary.UserData: no put_binding_name
> >
> > - ghc: internal error: MUT_VAR_CLEAN object entered!
> >     (GHC version 8.4.3 for aarch64_unknown_linux)
> >     Please report this as a GHC bug:  http://www.haskell.org/ghc/
> reportabug
> > Aborted (core dumped)
> >
> Ugh, that is awful.
>
> > The fix, excruciating as it may be on already slow arm machines, is to
> use
> > '-j1'. This issue seems present on each GHC release since 8.2.1
> (although I
> > haven't tried HEAD yet). I haven't noticed any issues with any other
> > concurrent Haskell programs on aarch64.
> >
> > There are some umbrella bugs for aarch64 in Trac, so I wanted to ask here
> > before filing a ticket. Has anyone else noticed this behavior on aarch64?
> > What's more, are there any tips for using GDB to hunt down
> synchronization
> > issues in GHC?
> >
> Definitely open a new ticket.
>
> The methodology for tracking down issues like this is quite
> case-specific but I do have some general recommendations: On x86-64 I
> use rr [1], which is an invaluable tool. Sadly this isn't an option on
> AArch64 AFAIK. I also have some gdb extensions to take much of the
> monotony away from inspecting GHC's heap and internal data structures
> [2]. I've not used them on AArch64 so there may be a few compatibility
> issues but I suspect they wouldn't be hard to fix.
>
> I know it may be hard in this case but I would at least try to reduce
> the size of the failing program to something that fits in less than a
> few hundred lines. Low-level debugging is hard enough when you can keep
> the program in your head; debugging all of GHC this way is possible but
> much harder. Given that this appears to be threading-specific, I would
> also pay particular attention to the GHC and base's use of barriers and
> atomics. It's possible that we are just missing a barrier somewhere.
>
> Finally, you might quickly try building 8.0 to see whether bisection is
> a possibility. It would be a slow process, given the speed of the
> hardware involved, but ultimately it can be much more time efficient
> once you have it setup since you can replace human debugging time (a
> very finite commodity) with computation.
>
> Good luck and let us know if you get stuck,
>
> - Ben
>
>
> [1] http://rr-project.org/
> [2] https://github.com/bgamari/ghc-utils/tree/master/gdb
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20180727/c995f1d0/attachment.html>


More information about the ghc-devs mailing list