setNumCapabilities 257 crashes a program

Artem Leshchev matshch at avride.ai
Fri Dec 6 14:51:47 UTC 2024


Hi,

I was going to report this bug on GitLab as wiki suggests, but this
requires my account to be manually approved, so I'm going to write it
down here. (But if you can approve "matshch" on GitLab I will
appreciate it.)

Consider the next program:

    import GHC.Conc
    main = setNumCapabilities 257

Compiling it with recent GHC and running it causes segmentation faults
sometimes, depending on the compiler version and the environment. Here
is the setup that works for me every time:

    $ printf "import GHC.Conc\nmain = setNumCapabilities 257" | docker
run -i --rm haskell:9.10.1-bullseye runghc; echo $?
    139

I first discovered this issue in Nix builds, and the next setup also
crashes every time for me (it uses GHC 9.6.6 under the hood):

    $ nix run nixpkgs/b681065d0919f7eb5309a93cea2cfa84dec9aa88#ghc --
-threaded conc.hs; ./conc
    [1]    311106 segmentation fault (core dumped)  ./conc

For this fail I have a stack trace that is showing that crash happens
somewhere in RTS:

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  0x000000000049d34f in assignNurseriesToCapabilities ()
    [Current thread is 1 (LWP 311106)]
    (gdb) bt
    #0  0x000000000049d34f in assignNurseriesToCapabilities ()
    #1  0x000000000049d96c in storageAddCapabilities ()
    #2  0x000000000047eaea in setNumCapabilities ()
    #3  0x000000000041042d in base_GHCziConcziSync_setNumCapabilities1_info ()
    #4  0x0000000000000000 in ?? ()

I have tried reproducing it with GHCup versions of GHC and have found
that it requires more capabilities to cause a crash. If you bump the
setNumCapabilities argument to 259, it causes the program to crash
under both ghc and runghc of versions 9.6.6, 9.8.4 and 9.10.1. I have
also encountered cases (with 258 capabilities IIRC) when it crashes
not on every program launch, and also noticed that ghci and runghc
looks more susceptible to the problem.

I first discovered this issue while upgrading our platform to NixOS
24.11, builds were failing on hedgehog tests. As I have discovered, it
was seeing that our build machine has 256 cores, planning to run 256
workers, so it set capabilities to number of cores + 2 (so 258 in our
case) to make some room for IO threads, and was crashing as I have
shown above.

-- 
Artem Leshchev


More information about the ghc-devs mailing list