[Haskell-cafe] GHC, odd concurrency space leak

Sat Apr 17 06:43:09 EDT 2010

On Sat, Apr 17, 2010 at 12:00 AM, Jason Dagit <dagit at codersbase.com> wrote:
>
> Myself and others posted "simpler" programs that had similar bad behavior,
> including the space leak (depending on optimizations flags).  I realize it's
> tedious to retest all those versions, but do you think you could check with
> one of the other versions that doesn't need mtl?

You got me curious enough that I decided to attack it systematically.
Here is a test-run script:

\begin{code}
#!/bin/bash

GHC68=/usr/local/stow/ghc-6.8.3/bin/ghc
GHC610=/usr/local/stow/ghc-6.10.4/bin/ghc
GHC612=ghc
GHC6HEAD=/usr/local/stow/ghc-6.13-20100416/bin/ghc

run_round () {
    EXE=$(basename ${1} .hs)
    echo ----------------------------------------------
    echo GHC68
    ${GHC68} --make $2 $1
    ./${EXE} +RTS -tstderr $3
    echo ----------------------------------------------
    echo GHC610
    ${GHC610} --make $2 $1
    ./${EXE} +RTS -tstderr $3
    echo ----------------------------------------------
    echo GHC612
    ${GHC612} --make $2 $1
    ./${EXE} +RTS -tstderr $3
    echo ----------------------------------------------
    echo GHC HEAD
    ${GHC6HEAD} --make -rtsopts $2 $1
    ./${EXE} +RTS -tstderr $3
}

run_round $1 $2 $3
\end{code}

With this script down, we can run your "Good" version:

jlouis at illithid:~$ sh runner.sh JD-Good.hs
----------------------------------------------
GHC68
[1 of 1] Compiling Main             ( JD-Good.hs, JD-Good.o )
Linking JD-Good ...
./JD-Good +RTS -tstderr
Main thread starting
Delaying
<<ghc: 1574462176 bytes, 3005 GCs, 40960/40960 avg/max bytes residency
(1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.00 MUT (1.00
elapsed), 0.01 GC (0.01 elapsed) :ghc>>
----------------------------------------------
GHC610
[1 of 1] Compiling Main             ( JD-Good.hs, JD-Good.o )
Linking JD-Good ...
./JD-Good +RTS -tstderr
Main thread starting
Delaying
<<ghc: 1475896128 bytes, 2816 GCs, 21280/21280 avg/max bytes residency
(1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 0.97 MUT (0.99
elapsed), 0.02 GC (0.01 elapsed) :ghc>>
----------------------------------------------
GHC612
[1 of 1] Compiling Main             ( JD-Good.hs, JD-Good.o )
Linking JD-Good ...
./JD-Good +RTS -tstderr
Main thread starting
Delaying
<<ghc: 667470136 bytes, 1274 GCs, 31384/31384 avg/max bytes residency
(1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.00 MUT (1.00
elapsed), 0.00 GC (0.01 elapsed) :ghc>>
----------------------------------------------
GHC HEAD
[1 of 1] Compiling Main             ( JD-Good.hs, JD-Good.o )
Linking JD-Good ...
./JD-Good +RTS -tstderr
Main thread starting
Delaying
<<ghc: 1013993664 bytes, 1935 GCs, 30600/30600 avg/max bytes residency
(1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.00 MUT (1.00
elapsed), 0.00 GC (0.01 elapsed) :ghc>>

Your "Bad" one doesn't terminate consistently on all versions of GHC.
Adding -threaded does not help. Neil Browns version is also consistent
over all versions of GHC and doesn't terminate. It does not matter if
I add -threaded.

> Well, I think Bulat correctly characterized the non-termination aspect.  I
> didn't think the cooperative aspect of threading applied with the threaded
> RTS, so I'm not 100% sure I believe his characterization, but otherwise it
> seems like a reasonable explanation.

It is certainly a valid explanation, and the most plausible at the
moment I think.

> The space leakiness is a different
> issue and likely worth a bug report in its own right.  Do you think you
> could try checking for the speak leaking using the compacting garbage
> collector?  I think that one is enabled with +RTS -c -RTS.

Oh, that gives some interesting progress:

Here is the run without -c:

jlouis at illithid:~$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
jlouis at illithid:~$ ghc --make -threaded Post.hs
jlouis at illithid:~$ ./Post +RTS -s
./Post +RTS -s
Main thread starting
Delaying
     840,429,800 bytes allocated in the heap
     336,183,280 bytes copied during GC
      86,294,808 bytes maximum residency (8 sample(s))
       2,648,600 bytes maximum slop
             171 MB total memory in use (3 MB lost due to fragmentation)

  Generation 0:  1596 collections,     0 parallel,  0.35s,  0.33s elapsed
  Generation 1:     8 collections,     0 parallel,  0.27s,  0.35s elapsed

  Parallel GC work balance: nan (0 / 0, ideal 1)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.32s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.37s    (  0.32s)       0.62s    (  0.68s)
  Task  2 (worker) :    0.00s    (  0.32s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.00s    (  0.32s)       0.00s    (  0.00s)

  SPARKS: 0 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.28s  (  0.32s elapsed)
  GC    time    0.62s  (  0.68s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.90s  (  1.00s elapsed)

  %GC time      68.9%  (67.6% elapsed)

  Alloc rate    3,001,331,338 bytes per MUT second

  Productivity  31.1% of total user, 27.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_large_objects: 0

And here with the compacting GC:

jlouis at illithid:~$ ./Post +RTS -s -c
./Post +RTS -s -c
Main thread starting
Delaying
      12,642,360 bytes allocated in the heap
       2,522,160 bytes copied during GC
       2,522,584 bytes maximum residency (3 sample(s))
          59,232 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    22 collections,     0 parallel,  0.02s,  0.01s elapsed
  Generation 1:     3 collections,     0 parallel,  5.08s,  5.09s elapsed

  Parallel GC work balance: nan (0 / 0, ideal 1)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.01s)       5.10s    (  5.10s)
  Task  2 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 0 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.01s  (  0.01s elapsed)
  GC    time    5.10s  (  5.10s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    5.10s  (  5.11s elapsed)

  %GC time      99.9%  (99.8% elapsed)

  Alloc rate    3,159,800,049 bytes per MUT second

  Productivity   0.0% of total user, 0.0% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_large_objects: 0

So it looks like it eliminates the space leak, but note how the
mutator doesn't get to do any work since we are using up all the time
in the GC. We only get to run 22 Gen0 collections and 3 Gen1
collections. In other words, I don't think it does anything to help
with the leak. According to heap profiling, two things take memory:
PAPs (which are partial applications to the RTS, that is functions
which are not yet fully applied), and a closure. It would make sense
that it is a PAP when one looks at the core. State monads are s -> (s,
a) and StateT with IO as the underlying monad gets translated into s
-> ioS -> (ioS, (s, a)), so I am not too confused about the PAP
appearing. Thinking more about this might reveal why the PAP appears
however.

Also, if you need me to run any kind of test against other RTS options
or programs, I'll be happy to do it. Just bump me :)

-- 
J.