[GHC] #8763: forM_ [1..N] does not get fused (10 times slower than go function)
GHC
ghc-devs at haskell.org
Thu Mar 29 13:07:00 UTC 2018
#8763: forM_ [1..N] does not get fused (10 times slower than go function)
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.6.1
Component: Compiler | Version: 7.6.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #7206 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by sgraf):
Here is a smaller example that highlights the problem without vectors. The
only difference between the two functions is the use of `[2,3..n]` instead
of `[2..n]`, which desugar to different functions. This results in a
difference in a huge difference in allocation as well as runtime:
{{{
$ ./repro 2 +RTS -s # [2,3..n]
()
960,056,856 bytes allocated in the heap
21,536 bytes copied during GC
44,576 bytes maximum residency (2 sample(s))
29,152 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 918 colls, 0 par 0.005s 0.003s 0.0000s
0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s
0.0002s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.123s ( 0.125s elapsed)
GC time 0.005s ( 0.003s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.129s ( 0.129s elapsed)
%GC time 4.1% (2.5% elapsed)
Alloc rate 7,778,808,106 bytes per MUT second
Productivity 95.8% of total user, 97.4% of total elapsed
}}}
{{{
$ ./repro 1 +RTS -s # [2..n]
()
56,872 bytes allocated in the heap
3,480 bytes copied during GC
44,576 bytes maximum residency (1 sample(s))
25,056 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s
0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s
0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.048s ( 0.048s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.048s ( 0.048s elapsed)
%GC time 0.2% (0.2% elapsed)
Alloc rate 1,188,432 bytes per MUT second
Productivity 99.6% of total user, 99.6% of total elapsed
}}}
This happens in `ST`, but not in `IO`, so probably related to some hack.
Also the difference vanishes when we allow the functions to inline.
Here's some Core for `g` (the offending function):
{{{
-- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}
$wg
$wg
= \ @ s ww w ->
let { $wc = <huge> } in
case <# ww 3# of {
__DEFAULT ->
let {
y'
y' = -# ww 1# } in
letrec {
go_up
go_up
= \ x eta ->
case ># x y' of {
__DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` <Co:4>)
eta;
1# -> $wc x (lvl `cast` <Co:4>) eta
}; } in
$wc 2# ((go_up 3#) `cast` <Co:4>) w;
1# ->
case <# ww 2# of {
__DEFAULT -> $wc 2# (lvl `cast` <Co:4>) w;
1# -> (# w, () #)
}
}
}}}
From my understanding of join points, `$wc` is only nearly a join point,
because `go_up` with its transitive tail call to `$wc` appears in argument
position. It would be great if we could get rid of this! The `IO` variant
(`g 40000000 >>= print`) doesn't have this weakness, it's all join points
there. Hence my suspicion about some `IO` hack that let's GHC eta-expand
stuff more aggresively, but I'm not sure how that's helping.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8763#comment:44>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list