[GHC] #14941: Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)
GHC
ghc-devs at haskell.org
Mon Mar 19 20:52:52 UTC 2018
#14941: Switching direct type family application to EqPred (~) prevents inlining in
code using vector (10x slowdown)
-------------------------------------+-------------------------------------
Reporter: nh2 | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Description changed by nh2:
Old description:
> {{{#!hs
> selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e,
> VG.Mutable v ~ vm)
> => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m ()
> -- slow
>
> selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
> => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m ()
> -- fast
> }}}
>
> These two functions are identical except one has `VG.Mutable v ~ vm` as a
> constraint, the other one has it in the type signature right of the `=>`.
>
> The second function is 10x faster.
>
> I would expect them to be equally fast.
>
> The code of the functions is identical, I change only the type
> declaration.
>
> The slowness happens because with the first function, inlining of
> primitives like `unsafeRead` does not happen, and thus also it boxes the
> `Int#`s back to `Int`s when calling `unsafeRead`.
>
> In particular, in `-ddump-simpl`, the slow version has
>
> {{{#!hs
> $wpartitionLoop2_rgEy
> :: forall (m :: * -> *) (vm :: * -> * -> *) e.
> (PrimMonad m, MVector vm e, Ord e) =>
> vm (PrimState m) e
> -> GHC.Prim.Int#
> -> e
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> m Int
> $wpartitionLoop2_rgEy
> = \... ->
> let {
> endIndex_a7Ay :: Int
> ...
> endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
> ...
> (VGM.basicUnsafeRead
> ...
> endIndex_a7Ay)
> ...
> }}}
>
> while the fast version has
>
> {{{#!hs
> $w$spartitionLoop2_rgUN
> :: GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.Int#
> -> GHC.Prim.State# GHC.Prim.RealWorld
> -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
>
> ...
>
> $spartitionLoop2_rgUP
> :: VG.Mutable VU.Vector (PrimState IO) Int
> -> Int
> -> Int
> -> Int
> -> Int
> -> Int
> -> GHC.Prim.State# GHC.Prim.RealWorld
> -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
> }}}
>
> So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed
> to inline + specialise it all the way down to concrete types
> (`VUM.MVector`, `IO`, and `Int` as the element type), and consequently
> inline `basicUnsafeRead` on unboxed `Int#`.
>
> But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way,
> passes around dictionaries, thus eventually cannot inline
> `basicUnsafeRead` and re-packs already unboxed values, like
> `endIndex_a7Ay = GHC.Types.I# ww2_sfZn`, before passing them into the
> non-inlined call of `basicUnsafeRead`, thus making a tight loop allocate
> that normally wouldn't allocate.
>
> Why might rewriting the type signature in such a trivial way make this
> happen?
>
> ----
>
> I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit
> cc4677c36ee.
>
> Reproducer:
>
> * https://github.com/nh2/haskell-quickselect-median-of-
> medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506
> * Running instructions are [https://github.com/nh2/haskell-quickselect-
> median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this
> commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts
> -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-
> full-laziness && ./Main +RTS -sstderr`
> * For that file, I have pregenerated `-dverbose-core2core` output here:
> https://github.com/nh2/haskell-quickselect-median-of-
> medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis
> * I originally just wanted to write a fast median-of-medians
> implementation on ZuriHac 2017, but got totally derailed by this
> performance problem. The version of it that I link here is a total mess
> because of me mauling it to track down this performance issue.
>
> Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness-
> factor` did not change the situation.
New description:
{{{#!hs
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e,
VG.Mutable v ~ vm)
=> Int -> v e -> vm (PrimState m) e -> Int -> Int -> m ()
-- slow
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
=> Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m ()
-- fast
}}}
These two functions are identical except one has `VG.Mutable v ~ vm` as a
constraint, the other one has it in the type signature right of the `=>`.
The second function is 10x faster.
I would expect them to be equally fast.
The code of the functions is identical, I change only the type
declaration.
The slowness happens because with the first function, inlining of
primitives like `unsafeRead` does not happen, and thus also it boxes the
`Int#`s back to `Int`s when calling `unsafeRead`.
In particular, in `-ddump-simpl`, the slow version has
{{{#!hs
$wpartitionLoop2_rgEy
:: forall (m :: * -> *) (vm :: * -> * -> *) e.
(PrimMonad m, MVector vm e, Ord e) =>
vm (PrimState m) e
-> GHC.Prim.Int#
-> e
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> m Int
$wpartitionLoop2_rgEy
= \... ->
let {
endIndex_a7Ay :: Int
...
endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
...
(VGM.basicUnsafeRead
...
endIndex_a7Ay)
...
}}}
while the fast version has
{{{#!hs
$w$spartitionLoop2_rgUN
:: GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
...
$spartitionLoop2_rgUP
:: VG.Mutable VU.Vector (PrimState IO) Int
-> Int
-> Int
-> Int
-> Int
-> Int
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
}}}
So with `VG.Mutable v (PrimState m) e` in the type signature, GHC managed
to inline + specialise it all the way down to concrete types
(`VUM.MVector`, `IO`, and `Int` as the element type), and consequently
inline `basicUnsafeRead` on unboxed `Int#`.
But with `VG.Mutable v ~ vm`, ghc keeps `vm (PrimState m) e` all the way,
passes around dictionaries, thus eventually cannot inline
`basicUnsafeRead` and re-packs already unboxed values, like `endIndex_a7Ay
= GHC.Types.I# ww2_sfZn`, before passing them into the non-inlined call of
`basicUnsafeRead`, thus making a tight loop allocate that normally
wouldn't allocate.
Why might rewriting the type signature in such a trivial way make this
happen?
----
I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit
cc4677c36ee (edit: in which case I commented out the quickcheck and
criterion related stuff because their deps don't build there yet).
Reproducer:
* https://github.com/nh2/haskell-quickselect-median-of-
medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506
* Running instructions are [https://github.com/nh2/haskell-quickselect-
median-of-medians/commit/7a49d673990dfaebdb0ba837c3fbbaae0455dba0 in this
commit message]; for short: `stack exec -- ghc -O --make Main.hs -rtsopts
-ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full-
laziness && ./Main +RTS -sstderr`
* For that file, I have pregenerated `-dverbose-core2core` output here:
https://github.com/nh2/haskell-quickselect-median-of-
medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis
* I originally just wanted to write a fast median-of-medians
implementation on ZuriHac 2017, but got totally derailed by this
performance problem. The version of it that I link here is a total mess
because of me mauling it to track down this performance issue.
Trying to increase `-funfolding-use-threshold` or `-funfolding-keeness-
factor` did not change the situation.
--
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14941#comment:2>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list