[GHC] #13623: join points produce bad code for stream fusion

Thu Apr 27 23:16:12 UTC 2017

#13623: join points produce bad code for stream fusion
-------------------------------------+-------------------------------------
        Reporter:  choenerzs         |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:  8.2.1
       Component:  Compiler          |              Version:  8.2.1-rc1
      Resolution:                    |             Keywords:  JoinPoints
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by choenerzs:

@@ -122,0 +122,9 @@
+
+ ===
+
+   ghc-options:
+     -O2
+     -ddump-to-file
+     -ddump-simpl
+     -dsuppress-all
+     -dshow-passes

New description:

 Below, I am generating to stream fusion streams xs and ys. Both
 parameterized on k l. The two streams are then concatenated. Finally I do
 a strict left fold.

 This example needs the 'vector' package but nothing else.

 {{{#!hs
 module Test where

 import Data.Vector.Fusion.Stream.Monadic as S


 foo :: Int -> Int -> IO Int
 foo = \i j -> S.foldl' (+) 0 $ xs i j S.++ ys i j
   where xs k l = S.enumFromStepN k l 2
         ys k l = S.enumFromStepN k l 3
         {-# Inline xs #-}
         {-# Inline ys #-}
 {-# Inline foo #-}
 }}}

 With ghc-8.0.1 I get nice core:

 {{{#!hs
 $wfoo_r1Ai
 $wfoo_r1Ai =
   \ ww_s1q5 ww1_s1q9 w_s1q2 ->
     letrec {
       $s$wfoldlM'_loop_s1xc
       $s$wfoldlM'_loop_s1xc =
         \ sc_s1x7 sc1_s1x5 sc2_s1x6 sc3_s1x4 ->
           case tagToEnum# (># sc2_s1x6 0#) of _ {
             False -> (# sc_s1x7, I# sc3_s1x4 #);
             True ->
               $s$wfoldlM'_loop_s1xc
                 sc_s1x7
                 (+# sc1_s1x5 ww1_s1q9)
                 (-# sc2_s1x6 1#)
                 (+# sc3_s1x4 sc1_s1x5)
           }; } in
     letrec {
       $s$wfoldlM'_loop1_s1x3
       $s$wfoldlM'_loop1_s1x3 =
         \ sc_s1x2 sc1_s1x0 sc2_s1x1 sc3_s1wZ ->
           case tagToEnum# (># sc2_s1x1 0#) of _ {
             False -> $s$wfoldlM'_loop_s1xc sc_s1x2 ww_s1q5 3# sc3_s1wZ;
             True ->
               $s$wfoldlM'_loop1_s1x3
                 sc_s1x2
                 (+# sc1_s1x0 ww1_s1q9)
                 (-# sc2_s1x1 1#)
                 (+# sc3_s1wZ sc1_s1x0)
           }; } in
     $s$wfoldlM'_loop1_s1x3 w_s1q2 ww_s1q5 2# 0#
 }}}

 Now the same with ghc-8.2-rc1. Here,
 [https://github.com/haskell/vector/blob/master/Data/Vector/Fusion/Stream/Monadic.hs
 Stream.++] function is not fully optimized away (Left and Right
 constructors!). Instead we have a join point that executes either of the
 two parts (xs or ys) based on a `case w2_s1U2 of {Left -> ; Right ->}`.

 {{{#!hs
 $wfoo_r23R
 $wfoo_r23R
   = \ ww_s1Ue ww1_s1Ui w_s1Ub ->
       let {
         x1_a1tj
         x1_a1tj = I# ww_s1Ue } in
       let {
         tb_a1wC
         tb_a1wC = (x1_a1tj, lvl1_r23Q) } in
       let {
         lvl2_s1Yh
         lvl2_s1Yh = Right tb_a1wC } in
       joinrec {
         $wfoldlM'_loop_s1U8
         $wfoldlM'_loop_s1U8 w1_s1U0 ww2_s1U6 w2_s1U2 w3_s1U3
           = case w1_s1U0 of { __DEFAULT ->
             case w2_s1U2 of {
               Left sa_a1yP ->
                 case sa_a1yP of { (w4_a1zr, m1_a1zs) ->
                 case m1_a1zs of { I# x2_a1zw ->
                 case tagToEnum# (># x2_a1zw 0#) of {
                   False -> jump $wfoldlM'_loop_s1U8 SPEC ww2_s1U6
 lvl2_s1Yh w3_s1U3;
                   True ->
                     case w4_a1zr of { I# y_a1xT ->
                     jump $wfoldlM'_loop_s1U8
                       SPEC
                       (+# ww2_s1U6 y_a1xT)
                       (Left (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw 1#)))
                       w3_s1U3
                     }
                 }
                 }
                 };
               Right sb_a1z3 ->
                 case sb_a1z3 of { (w4_a1zr, m1_a1zs) ->
                 case m1_a1zs of { I# x2_a1zw ->
                 case tagToEnum# (># x2_a1zw 0#) of {
                   False -> (# w3_s1U3, I# ww2_s1U6 #);
                   True ->
                     case w4_a1zr of { I# y_a1xT ->
                     jump $wfoldlM'_loop_s1U8
                       SPEC
                       (+# ww2_s1U6 y_a1xT)
                       (Right (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw
 1#)))
                       w3_s1U3
                     }
                 }
                 }
                 }
             }
             }; } in
       jump $wfoldlM'_loop_s1U8 SPEC 0# (Left (x1_a1tj, lvl_r23P)) w_s1Ub
 }}}

 For my stream-fusion heavy code, this yields a slowdown of approximately
 x4 (10 seconds with ghc-8.2-rc1, 2.5 seconds with ghc-8.0.1).

 ===

   ghc-options:
     -O2
     -ddump-to-file
     -ddump-simpl
     -dsuppress-all
     -dshow-passes

--

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13623#comment:3>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler