[GHC] #12690: Segmentation fault in GHC runtime system under low memory with USE_LARGE_ADDRESS_SPACE

Tue Oct 11 17:44:19 UTC 2016

#12690: Segmentation fault in GHC runtime system under low memory with
USE_LARGE_ADDRESS_SPACE
--------------------------------------+----------------------------------
           Reporter:  pggiarrusso     |             Owner:
               Type:  bug             |            Status:  new
           Priority:  high            |         Milestone:
          Component:  Runtime System  |           Version:  8.0.1
           Keywords:                  |  Operating System:  Linux
       Architecture:  x86_64 (amd64)  |   Type of failure:  Runtime crash
          Test Case:                  |        Blocked By:
           Blocking:                  |   Related Tickets:
Differential Rev(s):                  |         Wiki Page:
--------------------------------------+----------------------------------
 I have here a ~600MB core dump on GHC 8.0.1 x86_64 linux triggered under
 low memory (1GB RAM total, no swap, trying to build aeson with
 optimizations), with a stacktrace blaming the GHC runtime system. Based on
 code review, it appears that when requesting memory from the OS fails, the
 error is ignored and GHC segfaults as soon as it writes to the newly
 "allocated" memory.

 == Diagnosis ==

 My diagnosis is that osCommitMemory calls the potentially failing
 `my_mmap` but returns void (hence not propagating failures); it should
 instead probably use `my_mmap_or_barf` or handle the failure anyhow:

 https://github.com/ghc/ghc/blob/a6111b8cc14a5dc019e2613f6f634dec4eb57a8a/rts/posix/OSMem.c#L522-L525

 However, while I'm very convinced that's a bug, I'm only mostly confident
 that's the culprit, since the segfault happens later; I've just looked for
 bugs on allocation failures in the code ran shortly before the crash.

 == Raw data ==
 The core dump is on a VM from the reporter of
 https://github.com/commercialhaskell/stack/issues/2575,
 https://github.com/jiakai0419, which graciously gave me access to the VM
 and bothered reporting in the first place—I'd like to thank her.
 I'd like to avoid copying the dump here unless strictly needed.

 == Reproduction ==
 The bug is deterministic on that machine, but it requires a failure in
 mmap without triggering an OOM, so I'm not 100% sure how easy it is to
 reproduce but it should be easy on a VM with 1GB RAM and 1 core.

 Here's the output of `free`:
 {{{
 $ free
               total        used        free      shared  buff/cache
 available
 Mem:        1015352       95864       71396       51244      848092
 733132
 Swap:             0           0           0
 }}}
 Instructions to produce a dump. First, as root:
 {{{
 echo /tmp/core > /proc/sys/kernel/core_pattern
 }}}
 that's needed because otherwise the core file is produced in a temporary
 folder that stack removes. Then (as non-root, root has privileged access
 to memory), install stack, and run:
 {{{
 ulimit -c unlimited
 git clone https://github.com/jiakai0419/snowflake.git
 stack build --verbose
 }}}

 Below is some raw data from the core dump. In short, an initializer for
 newly allocated data segfaults, and gdb tells that the address it's trying
 to write (bd->start) to can't be accessed, even though it's lower than
 mblock_high_watermark so it should have been committed (mmap'ed) by
 getFreshMBlocks (I don't get getReusableMBlocks in detail, but it also
 calls `osCommitMemory`, so I guess there's the same invariant). The
 existence of mblock_high_watermark (beyond other reasons) should confirm
 this is using
 USE_LARGE_ADDRESS_SPACE. If this diagnosis is correct, the bug exists
 since USE_LARGE_ADDRESS_SPACE was introduced in
 https://github.com/ghc/ghc/commit/0d1a8d09f452977aadef7897aa12a8d41c7a4af0,
 so it's a regression in 8.0.1.
 {{{
 (gdb) bt
 #0  initMBlock (mblock=0x225c00000) at rts/sm/BlockAlloc.c:676
 #1  alloc_mega_group (mblocks=mblocks at entry=1) at rts/sm/BlockAlloc.c:328
 #2  0x00007fe650438d98 in allocGroup (n=1) at rts/sm/BlockAlloc.c:379
 #3  allocGroup (n=1) at rts/sm/BlockAlloc.c:336
 #4  0x00007fe65043ed55 in allocBlock_sync () at rts/sm/GCUtils.c:38
 #5  0x00007fe65043ef7d in alloc_todo_block (ws=ws at entry=0x13196b0,
 size=size at entry=3) at rts/sm/GCUtils.c:330
 #6  0x00007fe65043f13a in todo_block_full (size=size at entry=3,
 ws=0x13196b0) at rts/sm/GCUtils.c:292
 #7  0x00007fe65041baed in alloc_for_copy (gen_no=<optimized out>, size=3)
 at rts/sm/Evac.c:81
 #8  copy_tag (tag=2, gen_no=<optimized out>, size=3, src=0x21b353868,
 info=<optimized out>, p=0x225b13f10) at rts/sm/Evac.c:99
 #9  evacuate1 (p=p at entry=0x225b13f10) at rts/sm/Evac.c:596
 #10 0x00007fe65041d65c in scavenge_block1 (bd=0x225b004c0) at
 rts/sm/Scav.c:570
 #11 0x00007fe6504415dc in scavenge_find_work () at rts/sm/Scav.c:2040
 #12 scavenge_loop1 () at rts/sm/Scav.c:2103
 #13 0x00007fe65043cee2 in scavenge_until_all_done () at rts/sm/GC.c:968
 #14 0x00007fe65043d75a in GarbageCollect (collect_gen=collect_gen at entry=1,
 do_heap_census=do_heap_census at entry=rtsFalse, gc_type=gc_type at entry=2,
 cap=cap at entry=0x7fe650669a80 <MainCapability>)
     at rts/sm/GC.c:403
 #15 0x00007fe65042ff81 in scheduleDoGC (force_major=rtsFalse,
 task=0x13228a0, pcap=<optimized out>) at rts/Schedule.c:1652
 #16 scheduleDoGC (pcap=<optimized out>, task=0x13228a0,
 force_major=rtsFalse) at rts/Schedule.c:1433
 #17 0x00007fe650430c1a in schedule
 (initialCapability=initialCapability at entry=0x7fe650669a80
 <MainCapability>, task=task at entry=0x13228a0) at rts/Schedule.c:551
 #18 0x00007fe650431cf8 in scheduleWaitThread (tso=0x20000b388,
 ret=ret at entry=0x0, pcap=pcap at entry=0x7ffc4ee22c08) at rts/Schedule.c:2361
 #19 0x00007fe65042c528 in rts_evalLazyIO (cap=cap at entry=0x7ffc4ee22c08,
 p=p at entry=0x7895c0, ret=ret at entry=0x0) at rts/RtsAPI.c:500
 #20 0x00007fe65042e3a7 in hs_main (argc=96, argv=0x7ffc4ee22dc8,
 main_closure=0x7895c0, rts_config=...) at rts/RtsMain.c:64
 #21 0x0000000000428414 in main ()
 (gdb) info locals
 bd = 0x225c00100
 block = 0x225c04000 <Address 0x225c04000 out of bounds>
 (gdb) print/x mblock_high_watermark
 $11 = 0x225d00000
 (gdb) print bd->start
 Cannot access memory at address 0x225c00100
 (gdb) print *bd
 Cannot access memory at address 0x225c00100
 }}}

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/12690>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler