[GHC] #13623: join points produce bad code for stream fusion

Thu Apr 27 23:10:34 UTC 2017

#13623: join points produce bad code for stream fusion
-------------------------------------+-------------------------------------
           Reporter:  choenerzs      |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.2.1-rc1
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  Runtime
  Unknown/Multiple                   |  performance bug
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 Below, I am generating to stream fusion streams xs and ys. Both
 parameterized on k l. The two streams are then concatenated. Finally I do
 a strict left fold.

 This example needs the 'vector' package but nothing else.

 {{{
 module Test where

 import Data.Vector.Fusion.Stream.Monadic as S

 foo :: Int -> Int -> IO Int
 foo = \i j -> S.foldl' (+) 0 $ xs i j S.++ ys i j
   where xs k l = S.enumFromStepN k l 2
         ys k l = S.enumFromStepN k l 3
         {-# Inline xs #-}
         {-# Inline ys #-}
 {-# Inline foo #-}
 }}}

 With ghc-8.0.1 I get nice core:

 {{{
 $wfoo_r1Ai
 $wfoo_r1Ai =
   \ ww_s1q5 ww1_s1q9 w_s1q2 ->
     letrec {
       $s$wfoldlM'_loop_s1xc
       $s$wfoldlM'_loop_s1xc =
         \ sc_s1x7 sc1_s1x5 sc2_s1x6 sc3_s1x4 ->
           case tagToEnum# (># sc2_s1x6 0#) of _ {
             False -> (# sc_s1x7, I# sc3_s1x4 #);
             True ->
               $s$wfoldlM'_loop_s1xc
                 sc_s1x7
                 (+# sc1_s1x5 ww1_s1q9)
                 (-# sc2_s1x6 1#)
                 (+# sc3_s1x4 sc1_s1x5)
           }; } in
     letrec {
       $s$wfoldlM'_loop1_s1x3
       $s$wfoldlM'_loop1_s1x3 =
         \ sc_s1x2 sc1_s1x0 sc2_s1x1 sc3_s1wZ ->
           case tagToEnum# (># sc2_s1x1 0#) of _ {
             False -> $s$wfoldlM'_loop_s1xc sc_s1x2 ww_s1q5 3# sc3_s1wZ;
             True ->
               $s$wfoldlM'_loop1_s1x3
                 sc_s1x2
                 (+# sc1_s1x0 ww1_s1q9)
                 (-# sc2_s1x1 1#)
                 (+# sc3_s1wZ sc1_s1x0)
           }; } in
     $s$wfoldlM'_loop1_s1x3 w_s1q2 ww_s1q5 2# 0#
 }}}

 Now the same with ghc-8.2-rc1. Here,
 [https://github.com/haskell/vector/blob/master/Data/Vector/Fusion/Stream/Monadic.hs
 Stream.++] function is not fully optimized away (Left and Right
 constructors!). Instead we have a join point that executes either of the
 two parts (xs or ys) based on a case w2_s1U2 of {Left -> ; Right ->}.

 {{{
 $wfoo_r23R
 $wfoo_r23R
   = \ ww_s1Ue ww1_s1Ui w_s1Ub ->
       let {
         x1_a1tj
         x1_a1tj = I# ww_s1Ue } in
       let {
         tb_a1wC
         tb_a1wC = (x1_a1tj, lvl1_r23Q) } in
       let {
         lvl2_s1Yh
         lvl2_s1Yh = Right tb_a1wC } in
       joinrec {
         $wfoldlM'_loop_s1U8
         $wfoldlM'_loop_s1U8 w1_s1U0 ww2_s1U6 w2_s1U2 w3_s1U3
           = case w1_s1U0 of { __DEFAULT ->
             case w2_s1U2 of {
               Left sa_a1yP ->
                 case sa_a1yP of { (w4_a1zr, m1_a1zs) ->
                 case m1_a1zs of { I# x2_a1zw ->
                 case tagToEnum# (># x2_a1zw 0#) of {
                   False -> jump $wfoldlM'_loop_s1U8 SPEC ww2_s1U6
 lvl2_s1Yh w3_s1U3;
                   True ->
                     case w4_a1zr of { I# y_a1xT ->
                     jump $wfoldlM'_loop_s1U8
                       SPEC
                       (+# ww2_s1U6 y_a1xT)
                       (Left (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw 1#)))
                       w3_s1U3
                     }
                 }
                 }
                 };
               Right sb_a1z3 ->
                 case sb_a1z3 of { (w4_a1zr, m1_a1zs) ->
                 case m1_a1zs of { I# x2_a1zw ->
                 case tagToEnum# (># x2_a1zw 0#) of {
                   False -> (# w3_s1U3, I# ww2_s1U6 #);
                   True ->
                     case w4_a1zr of { I# y_a1xT ->
                     jump $wfoldlM'_loop_s1U8
                       SPEC
                       (+# ww2_s1U6 y_a1xT)
                       (Right (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw
 1#)))
                       w3_s1U3
                     }
                 }
                 }
                 }
             }
             }; } in
       jump $wfoldlM'_loop_s1U8 SPEC 0# (Left (x1_a1tj, lvl_r23P)) w_s1Ub
 }}}

 For my stream-fusion heavy code, this yields a slowdown of approximately
 x4 (10 seconds with ghc-8.2-rc1, 2.5 seconds with ghc-8.0.1).

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13623>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler