[GHC] #11795: Performance issues with replicateM_

Thu Apr 7 14:13:42 UTC 2016

#11795: Performance issues with replicateM_
-------------------------------------+-------------------------------------
        Reporter:  snoyberg          |                Owner:  snoyberg
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  libraries/base    |              Version:  7.10.3
      Resolution:                    |             Keywords:
Operating System:  MacOS X           |         Architecture:  x86_64
 Type of failure:  Runtime           |  (amd64)
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):  Phab:D2086
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by bgamari):

 On Phab:D2086 Simon posed the very reasonable question of //why// these
 two differ so significantly in performance.

 In my original reading of the patch I had assumed that the reason was that
 the "fast" version would enable unboxing of the `Int` accumulator while
 the "slow" version would not. Looking at the Core, however, this does not
 appear to be the difference: this accumulator is unboxed in both cases.

 The actual difference is that the self-recursive nature of the slow
 version inhibits inlining, which means we never get to specialise to the
 action. Instead we end up with,
 {{{#!hs
 $w$sreplicateM_noWW
   :: forall a. Int# -> IO a -> State# RealWorld -> (# State# RealWorld, ()
 #)
 $w$sreplicateM_noWW =
   \ (@ a) (ww :: Int#) (w :: IO a) (w1 :: State# RealWorld) ->
     case ww of ds {
       __DEFAULT ->
         case (w `cast` ...) w1 of _ { (# ipv, ipv1 #) ->
         $w$sreplicateM_noWW @ a (-# ds 1#) w ipv
         };
       0# -> (# w1, () #)
     }

 lvl2 :: Ptr CChar -> State# RealWorld -> (# State# RealWorld, () #)
 lvl2 =
   \ (x :: Ptr CChar) (eta :: State# RealWorld) ->
     $w$sreplicateM_noWW
       @ CChar
       100000000#
       (($fStorableInt21 (x `cast` ...)) `cast` ...)
       eta
 }}}
 Which gives us O(n) slow calls.

 This stands in stark contrast to the fast variant, which compiles down to
 this beautiful allocation-free loop,
 {{{#!hs
 $wgo12 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
 $wgo12 =
   \ (ww :: Int#) (w :: State# RealWorld) ->
     case tagToEnum# @ Bool (<=# ww 0#) of _ {
       False ->
         case getForeignEncoding1
         of _ { (getForeignEncoding5, setForeignEncoding1) ->
         case (getForeignEncoding5 `cast` ...) w of _ { (# ipv, ipv1 #) ->
         case charIsRepresentable3 @ () ipv1 lvl1 (lvl2 `cast` ...) ipv
         of _ { (# ipv2, ipv3 #) ->
         case seq# @ () @ RealWorld ipv3 ipv2 of _ { (# ipv4, ipv5 #) ->
         $wgo12 (-# ww 1#) ipv4
         } } } };
       True -> (# w, () #)
     }
 }}}

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/11795#comment:6>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler