[GHC] #9476: Implement late lambda-lifting
GHC
ghc-devs at haskell.org
Tue Nov 27 16:36:47 UTC 2018
#9476: Implement late lambda-lifting
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: sgraf
Type: feature request | Status: closed
Priority: normal | Milestone: 8.8.1
Component: Compiler | Version: 7.8.2
Resolution: fixed | Keywords: LateLamLift
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #8763 #13286 | Differential Rev(s): Phab:D5224
Wiki Page: LateLamLift |
-------------------------------------+-------------------------------------
Comment (by sgraf):
I'm currently writing the evaluation chapter of the paper and hit quite a
bummer.
If I add an option to ''ignore'' closure growth completely, instructions
improve by another 0.3% in the mean, with no regressions (!) wrt. the
baseline where we try to avoid closure growth. Although total allocations
possibly increase, executed instructions ''go down'' in some cases.
The most prominent case is `paraffins`: +18.6% more allocations, but
-11.7% less executed instructions! Example run:
{{{
$ ./default 19 +RTS -s > /dev/null
359,457,968 bytes allocated in the heap
536,016,504 bytes copied during GC
163,983,272 bytes maximum residency (8 sample(s))
1,699,968 bytes maximum slop
156 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 339 colls, 0 par 0.137s 0.137s 0.0004s
0.0025s
Gen 1 8 colls, 0 par 0.386s 0.386s 0.0482s
0.1934s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.058s ( 0.058s elapsed)
GC time 0.523s ( 0.523s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.581s ( 0.581s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,243,240,498 bytes per MUT second
Productivity 9.9% of total user, 9.9% of total elapsed
$ ./allow-cg 19 +RTS -s > /dev/null
426,433,296 bytes allocated in the heap
488,364,024 bytes copied during GC
139,063,776 bytes maximum residency (8 sample(s))
2,223,648 bytes maximum slop
132 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 403 colls, 0 par 0.136s 0.136s 0.0003s
0.0010s
Gen 1 8 colls, 0 par 0.317s 0.317s 0.0397s
0.1517s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.080s ( 0.080s elapsed)
GC time 0.453s ( 0.453s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.533s ( 0.533s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 5,359,023,067 bytes per MUT second
Productivity 14.9% of total user, 14.9% of total elapsed
}}}
Note how allocations and number of collections (these are correlated) went
up, but bytes copied and GC time (also correlated) went down. The only
possible conclusion here is that viewing bytes allocated as a metric for
predicting runtime is flawed: What matters most for runtime is bytes
copied during GC. Of course there's an overhead for heap checks etc., but
it seems that GC pressure far outweighs that.
Interestingly, if I 'disable' the GC by allocating 600MB of nursery, the
inverse effect manifests:
{{{
$ ./default 19 +RTS -s -A600M > /dev/null
359,104,696 bytes allocated in the heap
3,384 bytes copied during GC
44,480 bytes maximum residency (1 sample(s))
25,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s
0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s
0.0002s
INIT time 0.009s ( 0.009s elapsed)
MUT time 0.127s ( 0.127s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.136s ( 0.136s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,821,937,180 bytes per MUT second
Productivity 93.4% of total user, 93.5% of total elapsed
$ ./allow-cg 19 +RTS -s -A600M > /dev/null
426,014,360 bytes allocated in the heap
3,384 bytes copied during GC
44,480 bytes maximum residency (1 sample(s))
25,152 bytes maximum slop
0 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s
0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s
0.0002s
INIT time 0.011s ( 0.011s elapsed)
MUT time 0.142s ( 0.142s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.153s ( 0.153s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,994,250,656 bytes per MUT second
Productivity 92.9% of total user, 92.9% of total elapsed
}}}
This is probably because of cache effects, although apparently there seem
to be strictly less instructions executed in the `allow-cg` case, weird.
Also I don't think that enlarging the nursery to fit all allocations is a
realistic benchmark scenario.
The takeaway here is that lambda lifting stuff somehow has beneficial
effects on GC, even if overall allocations go up and more collections
happen as a consequence.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/9476#comment:55>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list