[GHC] #7919: Heap corruption (segfault) from large 'let' expression

Fri May 17 22:31:16 CEST 2013

#7919: Heap corruption (segfault) from large 'let' expression
--------------------------+-------------------------------------------------
Reporter:  duncan         |          Owner:                
    Type:  bug            |         Status:  new           
Priority:  normal         |      Component:  Runtime System
 Version:  7.6.3          |       Keywords:                
      Os:  Linux          |   Architecture:  x86_64 (amd64)
 Failure:  Runtime crash  |      Blockedby:                
Blocking:                 |        Related:                
--------------------------+-------------------------------------------------
 The attached test program reliably triggers an assertion in the storage
 manager with the `-debug` rts.

 {{{
 LargeUse: internal error: ASSERTION FAILED: file rts/sm/GCUtils.c, line
 208

     (GHC version 7.6.3 for x86_64_unknown_linux)
     Please report this as a GHC bug:
 http://www.haskell.org/ghc/reportabug
 }}}

 This behaviour is reproducible with many recent ghc versions (tried 7.6.3,
 7.4.2, 6.12.3) and all fail at the same assertion when using the `-debug`
 rts. (Without `-debug` we get a more random variety of segfaults and GC
 errors.)

 It looks like a pretty clear case of heap corruption. I'll explain why...

 The test program uses TH to generate a program that looks like this:
 {{{
 data Large = Large Int Int ... -- 512 non-strict Int fields

 test =
   let step (Large i1 i2 ... i512) =
         let j1 = i1 + i4
             j2 = i2 + i7
             ...
             j511 = i511 + i510
             j512 = i512 + i1
          in Large j1 j2 ... j512

    in runSteps step 100000 (Large 1 1 ... 1)

 -- basically an unfoldr:
 runSteps :: (state -> (state, Int)) -> Int -> state -> [Int]
 runSteps f n i | n <= 0    = []
                | otherwise = case f i of
                                (i', r) -> r : runSteps f (n - 1) i'
 }}}

 We use TH to generate this program, and we use a "size" parameter that
 determines size of the data constructor (and corresponding letrec). This
 makes it easy to find the size threshold where it fails.

 For small sizes this program works fine, and for larger values it triggers
 the assert. With ghc 7.6.3 on a x86-64 machine, the magic threshold is
 511: that is, the program works fine with size 510 and hits the assertion
 at size 511. This is suspiciously close to 512. And of course on a 64bit
 machine 512 * 8 is 4k, which is the storage manager's block size. And the
 failing assertion is in a bit of the storage manager that is dealing with
 blocks...

 {{{
         // If this block does not have enough space to allocate the
         // current object, but it also doesn't have any work to push, then
         // push it on to the scanned list.  It cannot be empty, because
         // then there would be enough room to copy the current object.
         if (bd->u.scan == bd->free)
         {
             ASSERT(bd->free != bd->start);
             push_scanned_block(bd, ws);
         }
 }}}

 So it looks very much like we have a situation where something is writing
 over the end of a block and messing up the SM's data structures.

 But, it is not nearly as simple as the data constructor being too big. We
 can demonstrate other programs that use much larger data constructors
 without any problem at all. Our suspicion falls on the big letrec.

 Indeed with this program if we change the data constructor to have strict
 fields then it no longer fails, and we can run it with much larger data
 constructor sizes. What would be different between strict and non-strict
 fields here? Well, one observation is that when it is strict then ghc can
 (and does) turn the code into a big cascade of case expressions, while
 when it is non-strict then the STG code is all 'let's.

 {{{
         case tpl_s6jQ of tpl_s6Ak {
           __DEFAULT ->
               case tpl_s6jS of tpl_s6Al {
                 __DEFAULT ->
                     case tpl_s6jU of tpl_s6Am {
     -- etc for all 500+ elements
 }}}
 versus
 {{{
               let {
                 sat_s5UK :: GHC.Types.Int
                 [LclId] =
                     \u [] GHC.Num.$fNumInt_$c+ i511_s5Ly i1_s5E9; } in
               let {
                 sat_s62X :: GHC.Types.Int
                 [LclId] =
                     \u [] GHC.Num.$fNumInt_$c+ i510_s5R2 i509_s5UG; } in
               let {
                 sat_s62W :: GHC.Types.Int
                 [LclId] =
                     \u [] GHC.Num.$fNumInt_$c+ i509_s5UG i506_s5UC; } in
     -- etc for all 500+ elements
 }}}

 Note also, that it is nothing to do with the obvious space leak here. If
 we modify the code to generate an NFData instance and to use deepseq at
 each iteration then we eliminate the space leak, but we keep the big stg
 'let', and it still fails.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7919>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler