[Haskell-cafe] RE: GHC optimization issue

Thu Dec 29 04:22:06 EST 2005

I did look into this a little.  There are several things going on

* GHC doesn't really expect you to use INLINE and SPECIALISE together.
INLINE says to inline a copy of the function at every call site, which
is the best possible form of specialisation, so SPECIALISE is a bit
redundant if you are happy to do that.  (Incidentally, INLINE is ignored
for recursive functions, whereas SPECIALISE works fine.)

* If you have both INLINE and SPECIALISE, GHC makes a specialised copy
of the function, and the specialised copy takes precedence. So a call at
the specialised type turns into a call to the specialised function.   (A
call at any other type would not use the specialisation, and so the
un-specialised function would be inlined.)  Moreover, the specialised
copies are themselves not marked INLINE.

* GHC's HEAD in CVS (i.e 6.5) has a new feature
	{#- SPECIALISE INLINE f :: <type> #-}
which tells GHC to specialise f and inline the specialised copy (only).
Ordinary, un-specialised calls will not be inlined.

* When you delete the pragmas, GHC uses a size-based heuristic to decide
when to inline a function at a call site.  Oddly, this heuristic decides
(just) to inline the un-specialised function, but when it's specialised
GHC thinks it's bigger and doesn't inline it.    You can increase the
threshold for inlining with -funfolding-use-threshold=20, say.

http://www.haskell.org/ghc/docs/latest/html/users_guide/options-optimise
.html#options-f
This really is a bit odd, but I don't want to fiddle with GHC's inlining
heuristics without doing lots of benchmark runs etc.  Actually, it'd be
quite an easy thing for others to try fiddling with, and I bet there are
performance gains to be had.

Simon

| -----Original Message-----
| From: Joel Reymont [mailto:joelr1 at gmail.com]
| Sent: 22 December 2005 14:09
| To: Haskell-Cafe Cafe
| Cc: Simon Peyton-Jones; Simon Marlow
| Subject: GHC optimization issue
| 
| Folks,
| 
| I have been trying to improve my byte swapping routines as part of my
| effort to speed up serialization. I then tried to look at the core
| output from GHC to see what it was converting my code into. Brandon
| (skew on #haskell) helped me code a TH version but then I went with a
| regular CPP version instead.
| 
| The point of contention is that the logical approach to optimization
| does not produce expected results. The only difference between the
| following two core outputs is the presence or absence of specialize
| and inline pragmas. The code is the same.
| 
| The version without pragmas inlines foo1 into
| 
| Main.foo1 :: GHC.Word.Word16
| [GlobalId]
| [Str: DmdType]
| Main.foo1 = case GHC.Word.Word16 GHC.Word.$wshift3 __word 3855 (-8)
|              of ww1_a2eo { __DEFAULT ->
|              case GHC.Word.Word16 GHC.Word.$wshift3 __word 15 8 of
| ww_a2f4 { __DEFAULT ->
|              GHC.Word.W16# (GHC.Prim.narrow16Word#
| (GHC.Prim.plusWord# ww1_a2eo ww_a2f4))
|              }
|              }
| 
| Whereas the version _with_ pragmas produces a function call:
| 
| Main.foo1 :: GHC.Word.Word16
| [GlobalId]
| [Str: DmdType]
| Main.foo1 = case GHC.Word.Word16 Swap1.$w$sswap16 __word 3855
|              of ww1_s1Ia { __DEFAULT ->
|              GHC.Word.W16# ww1_s1Ia
|              }
| 
| Is there a reasonable explanation?
| 
| Both versions compiled thusly: ghc --make Foo.hs -O -ddump-simpl > foo
| 
| Swap:
| 
| {-# OPTIONS_GHC -fglasgow-exts -cpp #-}
| module Swap1
| (
| swap16
| )
| where
| 
| import Data.Word
| import Data.Int
| import Data.Bits
| 
| #define BIG_ENDIAN 1
| 
| {-# SPECIALIZE swap16 :: Word16 -> Word16 #-}
| {-# SPECIALIZE swap16 :: Int16 -> Int16 #-}
| {-# INLINE swap16 #-}
| 
| swap16 :: Bits a => a -> a
| #ifdef BIG_ENDIAN
| swap16 v = (v `shiftR` 8) + ((v .&. 0xFF) `shiftL` 8)
| #else
| swap16 v = v
| #endif
| 
| Foo:
| 
| module Main where
| 
| import Data.Word
| import Swap1
| 
| foo1 :: Word16
| foo1 = swap16 0x0f0f
| 
| main = putStrLn $ show foo1
| 
| Core output for the version WITH pragmas, see the version w/o pragmas
| way below
| 
| ==================== Tidy Core Rules ====================
| "SPEC Swap1.swap16" __forall {$dBits_X1Co :: {Data.Bits.Bits
| GHC.Int.Int16}}
|                        Swap1.swap16 @ GHC.Int.Int16 $dBits_X1Co
|                        = Swap1.$sswap16 ;
| "SPEC Swap1.swap16" __forall {$dBits_X1Cv :: {Data.Bits.Bits
| GHC.Word.Word16}}
|                        Swap1.swap16 @ GHC.Word.Word16 $dBits_X1Cv
|                        = Swap1.$sswap161 ;
| 
| 
| ==================== Tidy Core ====================
| Main.foo1 :: GHC.Word.Word16
| [GlobalId]
| [Str: DmdType]
| Main.foo1 = case GHC.Word.Word16 Swap1.$w$sswap16 __word 3855
|              of ww1_s1Ia { __DEFAULT ->
|              GHC.Word.W16# ww1_s1Ia
|              }
| 
| ==================== Tidy Core ====================
| Swap1.$w$sswap16 :: GHC.Prim.Word# -> GHC.Prim.Word#
| [GlobalId]
| [Arity 1
| NoCafRefs
| Str: DmdType L]
| Swap1.$w$sswap16 = \ (ww_s1I5 :: GHC.Prim.Word#) ->
|                       case GHC.Prim.Word# GHC.Word.$wshift3 ww_s1I5
| (-8) of ww1_a1Gx { __DEFAULT ->
|                       case GHC.Prim.Word# GHC.Word.$wshift3
| (GHC.Prim.and# ww_s1I5 __word 255) 8
|                       of ww11_a1Hd { __DEFAULT ->
|                       GHC.Prim.narrow16Word# (GHC.Prim.plusWord#
| ww1_a1Gx ww11_a1Hd)
|                       }
|                       }
| 
| Swap1.$w$sswap161 :: GHC.Prim.Int# -> GHC.Prim.Int#
| [GlobalId]
| [Arity 1
| NoCafRefs
| Str: DmdType L]
| Swap1.$w$sswap161 = \ (ww_s1HV :: GHC.Prim.Int#) ->
|                        case GHC.Prim.Int# GHC.Int.$wshift2 ww_s1HV
| (-8) of ww1_a1Fk { __DEFAULT ->
|                        case GHC.Prim.Int# GHC.Int.$wshift2
|                                             (GHC.Prim.word2Int#
| (GHC.Prim.and# (GHC.Prim.int2Word# ww_s1HV) __word 255))
|                                             8
|                        of ww11_a1G5 { __DEFAULT ->
|                        GHC.Prim.narrow16Int# (GHC.Prim.+# ww1_a1Fk
| ww11_a1G5)
|                        }
|                        }
| 
| Swap1.$sswap16 :: GHC.Int.Int16 -> GHC.Int.Int16
| [GlobalId]
| [Arity 1
| Worker Swap1.$w$sswap161
| NoCafRefs
| Str: DmdType U(L)m]
| Swap1.$sswap16 = __inline_me (\ (w_s1HT :: GHC.Int.Int16) ->
|                                  case GHC.Int.Int16 w_s1HT of w1_X1I5
| { GHC.Int.I16# ww_s1HV ->
|                                  case GHC.Int.Int16 Swap1.$w$sswap161
| ww_s1HV of ww1_s1I0 { __DEFAULT ->
|                                  GHC.Int.I16# ww1_s1I0
|                                  }
|                                  })
| 
| Swap1.$sswap161 :: GHC.Word.Word16 -> GHC.Word.Word16
| [GlobalId]
| [Arity 1
| Worker Swap1.$w$sswap16
| NoCafRefs
| Str: DmdType U(L)m]
| Swap1.$sswap161 = __inline_me (\ (w_s1I3 :: GHC.Word.Word16) ->
|                                   case GHC.Word.Word16 w_s1I3 of
| w1_X1Ih { GHC.Word.W16# ww_s1I5 ->
|                                   case GHC.Word.Word16 Swap1.$w
| $sswap16 ww_s1I5 of ww1_s1Ia { __DEFAULT ->
|                                   GHC.Word.W16# ww1_s1Ia
|                                   }
|                                   })
| 
| Swap1.swap16 :: forall a_a1aQ. (Data.Bits.Bits a_a1aQ) => a_a1aQ ->
| a_a1aQ
| [GlobalId]
| [Arity 1
| NoCafRefs
| Str: DmdType L]
| Swap1.swap16 = __inline_me (\ (@ a_a1dB)
|                                ($dBits_a1ip :: {Data.Bits.Bits
| a_a1dB}) ->
|                                let {
|                                  $dNum_s1BO :: {GHC.Num.Num a_a1dB}
|                                  [Str: DmdType {a1ip->U
| (SAAAAAAAAAAAAAAAAA)}]
|                                  $dNum_s1BO = Data.Bits.$p1Bits @
| a_a1dB $dBits_a1ip } in
|                                let {
|                                  lit_s1BN :: a_a1dB
|                                  [Str: DmdType {a1ip->U
| (SAAAAAAAAAAAAAAAAA) s1BO->U(AAAAAAAAS)}]
|                                  lit_s1BN = GHC.Num.fromInteger @
| a_a1dB $dNum_s1BO (GHC.Num.S# 255)
|                                } in
|                                  \ (v_a1aW :: a_a1dB) ->
|                                    GHC.Num.+
|                                      @ a_a1dB
|                                      $dNum_s1BO
|                                      (Data.Bits.shiftR @ a_a1dB
| $dBits_a1ip v_a1aW (GHC.Base.I# 8))
|                                      (Data.Bits.shiftL
|                                         @ a_a1dB
|                                         $dBits_a1ip
|                                         (Data.Bits..&. @ a_a1dB
| $dBits_a1ip v_a1aW lit_s1BN)
|                                         (GHC.Base.I# 8)))
| 
| 
| Core output without pragmas:
| 
| Main.foo1 :: GHC.Word.Word16
| [GlobalId]
| [Str: DmdType]
| Main.foo1 = case GHC.Word.Word16 GHC.Word.$wshift3 __word 3855 (-8)
|              of ww1_a2eo { __DEFAULT ->
|              case GHC.Word.Word16 GHC.Word.$wshift3 __word 15 8 of
| ww_a2f4 { __DEFAULT ->
|              GHC.Word.W16# (GHC.Prim.narrow16Word#
| (GHC.Prim.plusWord# ww1_a2eo ww_a2f4))
|              }
|              }
| 
| ---
| Swap1.swap16 :: forall a_a1aQ. (Data.Bits.Bits a_a1aQ) => a_a1aQ ->
| a_a1aQ
| [GlobalId]
| [Arity 1
| NoCafRefs
| Str: DmdType L]
| Swap1.swap16 = \ (@ a_a1dB) ($dBits_a1ip :: {Data.Bits.Bits a_a1dB})
->
|                   let {
|                     $dNum_s1BK :: {GHC.Num.Num a_a1dB}
|                     [Str: DmdType]
|                     $dNum_s1BK = case {GHC.Num.Num a_a1dB} $dBits_a1ip
|                                  of tpl_B1 { Data.Bits.:DBits tpl1_B2
| tpl2_B3 tpl3_B4 tpl4_B5 tpl5_B6 tpl6_B7 tpl7_B8 tpl8_B9 tpl9_Ba
| tpl10_Bb tpl11_Bc tpl12_Bd tpl13_Be tpl14_Bf tpl15_Bg tpl16_Bh
| tpl17_Bi tpl18_Bj ->
|                                  tpl1_B2
|                                  } } in
|                   let {
|                     lit_s1BM :: a_a1dB
|                     [Str: DmdType]
|                     lit_s1BM = case a_a1dB $dNum_s1BK
|                                of tpl_B1 { GHC.Num.:DNum tpl1_B2
| tpl2_B3 tpl3_B4 tpl4_B5 tpl5_B6 tpl6_B7 tpl7_B8 tpl8_B9 tpl9_Ba ->
|                                tpl9_Ba Swap1.lvl1
|                                }
|                   } in
|                     \ (v_a1aW :: a_a1dB) ->
|                       tpl3_B4
|                         (case a_a1dB $dBits_a1ip
|                          of tpl10_Xy { Data.Bits.:DBits tpl11_XB
| tpl12_XD tpl13_XF tpl14_XH tpl15_XJ tpl16_XL tpl17_XN tpl18_XP
| tpl19_XR tpl20_Bb tpl21_Bc tpl22_Bd tpl23_Be tpl24_Bf tpl25_Bg
| tpl26_Bh tpl27_Bi tpl28_Bj ->
|                          tpl26_Bh v_a1aW Swap1.lvl
|                          })
|                         (case a_a1dB $dBits_a1ip
|                          of tpl10_Xy { Data.Bits.:DBits tpl11_XB
| tpl12_XD tpl13_XF tpl14_XH tpl15_XJ tpl16_XL tpl17_XN tpl18_XP
| tpl19_XR tpl20_Bb tpl21_Bc tpl22_Bd tpl23_Be tpl24_Bf tpl25_Bg
| tpl26_Bh tpl27_Bi tpl28_Bj ->
|                          tpl25_Bg (tpl12_XD v_a1aW lit_s1BM) Swap1.lvl
|                          })
| 
| --
| http://wagerlabs.com/
| 
| 
| 
|