[Hs-Generics] Inlining and generic programming

Wed Feb 22 15:32:32 CET 2012

Hello all,

First of all, I'm sorry that this email is so absurdly long. But it's not
easy to explain the problem at hand, so I took a step-by-step approach. The
executive summary is: GHC can do a *great *job with inlining, but often it
doesn't, and I don't understand why. So I have some questions, which are
highlighted in the text below. In general, any insights regarding inlining
or improving the performance of generics are welcome. My final goal is to
be able to state that generic functions (in particular using GHC.Generics)
will have no runtime overhead whatsoever when compared to a handwritten
type-specific version.

The setting

Generic programming is based on representing datatypes in a uniform way
using a small set of representation types. Functions defined on those
representation types can then be applied to all datatypes, because we can
convert between datatypes and their representations.
However, generic functions tend to be slower than their specialised
counterparts, because they have to deal with the conversions. But clever
inlining (together with other compiler optimisations) can completely remove
this overhead. The problem I'm tackling is how to tell GHC exactly what it
should in the particular case of optimisation of generic code.

Simplified example

I'll focus on the problem of optimising a non-trivial function for generic
enumeration of terms. My experience shows that GHC does quite good at
optimising simple functions, especially consumers (like generic equality).
But producers are trickier.

First, we'll need some auxiliary functions:

-- | Interleave elements from two lists. Similar to (++), but swap left and
> -- right arguments on every recursive application.
> --
> -- From Mark Jones' talk at AFP2008
> {-# NOINLINE (|||) #-}
> (|||) :: [a] -> [a] -> [a]
> []     ||| ys = ys
> (x:xs) ||| ys = x : ys ||| xs
>
>
> -- | Diagonalization of nested lists. Ensure that some elements from every
> -- sublist will be included. Handles infinite sublists.
> --
> -- From Mark Jones' talk at AFP2008
> {-# NOINLINE diag #-}
> diag :: [[a]] -> [a]
> diag = concat . foldr skew [] . map (map (\x -> [x]))
>
> skew :: [[a]] -> [[a]] -> [[a]]
> skew []     ys = ys
> skew (x:xs) ys = x : combine (++) xs ys
>
> combine :: (a -> a -> a) -> [a] -> [a] -> [a]
> combine _ xs     []     = xs
> combine _ []     ys     = ys
> combine f (x:xs) (y:ys) = f x y : combine f xs ys
>

The particular implementation of these functions doesn't really matter.
What's important is that we have a way to interleave lists (|||) and a way
to diagonalise a matrix into a list (diag). We mark these functions as
NOINLINE because inlining them will only make the core code more
complicated (and may prevent rules from firing).

Suppose we have a type of Peano natural numbers:

data Nat = Ze | Su Nat deriving Eq
>

Implementing enumeration on this type is simple:

enumNat :: [Nat]
> enumNat = [Ze] ||| map Su enumNat
>

Now, a generic representation of Nat in terms of sums and products could
look something like this:

type RepNat = Either () Nat
>

That is, either a singleton (for the Ze case) or a Nat (for the Su case).
Note that I am building a shallow representation, since at the leaves we
have Nat, and not RepNat. This mimics the situation with current generic
programming libraries (in particular GHC.Generics).

We'll need a way to convert between RepNat and Nat:

toNat :: RepNat -> Nat
> toNat (Left ()) = Ze
> toNat (Right n) = Su n
>
> fromNat :: Nat -> RepNat
> fromNat Ze = Left ()
> fromNat (Su n) = Right n
>

(In fact, since we're only dealing with a generic producer we won't need
the fromNat function.)

To get an enumeration for RepNat, we first need to know how to enumerate
units and sums:

enumU :: [()]
> enumU = [()]
>
> enumEither :: [a] -> [b] -> [Either a b]
> enumEither ea eb = map Left ea ||| map Right eb
>

Now we can define an enumeration for RepNat:

enumRepNat :: [RepNat]
> enumRepNat = enumEither enumU enumNatFromRep
>

With the conversion function toNat, we can use enumRepNat to get an
enumeration for Nat directly:

enumNatFromRep :: [Nat]
> enumNatFromRep = map toNat enumRepNat
>

First, convince yourself that enumNatFromRep and enumNat are equivalent
functions:

take 100 enumNat == take 100 enumNatFromRep
>

Now, what I want is that enumNatFromRep generates the same core code as
enumNat. That should be possible; here are the necessary steps:

  map toNat enumRepNat
>
> == { inline enumRepNat }
>
>   map toNat (enumEither enumU enumNatFromRep)
>
> == { inline enumEither }
>
>   map toNat (map Left enumU ||| map Right enumNatFromRep)
>
> == { inline enumU }
>
>   map toNat (map Left [()] ||| map Right enumNatFromRep)
>
> == { inline map }
>
>   map toNat ([Left ()] ||| map Right enumNatFromRep)
>
> == { free theorem (|||): forall f a b. map f (a ||| b) = map f a ||| map f
> b }
>
>   map toNat [Left ()] ||| map toNat (map Right enumNatFromRep)
>
> == { inline map }
>
>   [toNat (Left ())] ||| map toNat (map Right enumNatFromRep)
>
> == { definition of toNat (or inline toNat + case of constant) }
>
>   [Ze] ||| map toNat (map Right enumNatFromRep)
>
> == { functor composition law: forall f g l. map f (map g l) = map (f . g)
> l }
>
>   [Ze] ||| map (toNat . Right) enumNatFromRep
>
> == { definition of toNat (or inline toNat + case of constant) }
>
>   [Ze] ||| map Su enumNatFromRep
>

Now let's see what the compiler generates. I'm using GHC-7.4.1. Let's
compile with -O1 and use -ddump-simpl to see the final simplifier output
(core code) for enumNatFromRep:

EnumAlone.enumNatFromRep :: [EnumAlone.Nat]
> [GblId,
>  Str=DmdType,
>  Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
>          ConLike=False, Cheap=False, Expandable=False,
>          Guidance=IF_ARGS [] 30 0}]
> EnumAlone.enumNatFromRep =
>   GHC.Base.map
>     @ EnumAlone.RepNat
>     @ EnumAlone.Nat
>     EnumAlone.toNat
>     EnumAlone.enumRepNat
>
> EnumAlone.enumRepNat [Occ=LoopBreaker] :: [EnumAlone.RepNat]
> [GblId, Str=DmdType]
> EnumAlone.enumRepNat =
>   EnumAlone.|||
>     @ (Data.Either.Either () EnumAlone.Nat) lvl4_rvV lvl5_rvW
>

Ah, it didn't even inline enumRepNat because it made it a loop breaker. We
certainly want to inline it, so let's add a pragma:

{-# INLINE enumRepNat #-}
>

Recompiling, we get:

EnumAlone.enumRepNat [InlPrag=INLINE (sat-args=0)]
>   :: [EnumAlone.RepNat]
> [GblId,
>  Str=DmdType,
>  Unf=Unf{Src=InlineStable, TopLvl=True, Arity=0, Value=False,
>          ConLike=False, Cheap=False, Expandable=False,
>          Guidance=ALWAYS_IF(unsat_ok=False,boring_ok=False)
>          Tmpl= EnumAlone.enumEither
>                  @ () @ EnumAlone.Nat EnumAlone.enumU
> EnumAlone.enumNatFromRep}]
> EnumAlone.enumRepNat =
>   EnumAlone.|||
>     @ (Data.Either.Either () EnumAlone.Nat) lvl4_rvV lvl5_rvW
>
> EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
> [GblId, Str=DmdType]
> EnumAlone.enumNatFromRep =
>   GHC.Base.map
>     @ EnumAlone.RepNat
>     @ EnumAlone.Nat
>     EnumAlone.toNat
>     EnumAlone.enumRepNat
>

So no real difference in the generated code, other than the reassignment of
loop breakers. For some reason enumRepNat still doesn't get inlined.

*Question: why won't GHC inline enumRepNat, even when I tell it to do so
with an INLINE pragma?*

Well, let's inline it ourselves, then. We redefine enumNatFromRep to:

enumNatFromRep = map toNat (enumEither enumU enumNatFromRep)
>

This however doesn't help much. We get the following core:

lvl6_rw5 :: [Data.Either.Either () EnumAlone.Nat]
> [GblId]
> lvl6_rw5 =
>   EnumAlone.|||
>     @ (Data.Either.Either () EnumAlone.Nat) lvl4_rw3 lvl5_rw4
>
> EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
> [GblId, Str=DmdType]
> EnumAlone.enumNatFromRep =
>   GHC.Base.map
>     @ EnumAlone.RepNat @ EnumAlone.Nat EnumAlone.toNat lvl6_rw5
>

GHC is really keen on floating that (|||) out. Let's be very explicit about
inlining:

{-# INLINE toNat #-}
> {-# INLINE enumU #-}
> {-# INLINE enumEither #-}
>

Also, maybe it's floating it out because it doesn't have anything else to
do to it. Let's add the free theorem of (|||) as a rule:

{-# RULES "ft |||" forall f a b. map f (a ||| b) = map f a ||| map f b #-}
>

We needed this in our manual derivation, so GHC should need it too.
Recompiling, we see we've made some progress:

lvl5_ryv :: [Data.Either.Either () EnumAlone.Nat]
> [GblId]
> lvl5_ryv =
>   GHC.Base.map
>     @ EnumAlone.Nat
>     @ (Data.Either.Either () EnumAlone.Nat)
>     (Data.Either.Right @ () @ EnumAlone.Nat)
>     EnumAlone.enumNatFromRep
>
> lvl6_ryw :: [EnumAlone.Nat]
> [GblId]
> lvl6_ryw =
>   GHC.Base.map
>     @ (Data.Either.Either () EnumAlone.Nat)
>     @ EnumAlone.Nat
>     EnumAlone.toNat
>     lvl5_ryv
>
> EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
> [GblId, Str=DmdType]
> EnumAlone.enumNatFromRep =
>   EnumAlone.||| @ EnumAlone.Nat lvl4_ryu lvl6_ryw
>

enumNatFromRep finally starts with (|||) directly. But its second argument,
lvl6_ryw, is a map of lvl5_ryv, which is itself a map! At this stage I
expected GHC to be aware of the fusion law for map, but it seems that it
isn't.

*Question: why is map fusion not happening automatically?*

Let's add it as a rule:

{-# RULES "map/map1" forall f g l. map f (map g l) = map (f . g) l #-}
>

And now we're in a much better situation:

lvl3_ryD :: [Data.Either.Either () EnumAlone.Nat]
> [GblId, Caf=NoCafRefs]
> lvl3_ryD =
>   GHC.Types.:
>     @ (Data.Either.Either () EnumAlone.Nat)
>     EnumAlone.fromNat1
>     (GHC.Types.[] @ (Data.Either.Either () EnumAlone.Nat))
>
> lvl4_ryE :: [EnumAlone.Nat]
> [GblId]
> lvl4_ryE =
>   GHC.Base.map
>     @ (Data.Either.Either () EnumAlone.Nat)
>     @ EnumAlone.Nat
>     EnumAlone.toNat
>     lvl3_ryD
>
> lvl5_ryF :: [EnumAlone.Nat]
> [GblId]
> lvl5_ryF =
>   GHC.Base.map
>     @ EnumAlone.Nat
>     @ EnumAlone.Nat
>     EnumAlone.Su
>     EnumAlone.enumNatFromRep
>
> EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
> [GblId, Str=DmdType]
> EnumAlone.enumNatFromRep =
>   EnumAlone.||| @ EnumAlone.Nat lvl4_ryE lvl5_ryF
>

Note how toNat is entirely gone from the second part of the enumeration
(lvl5_ryF). Strangely enough, the enumerator for Ze (lvl4_ryE) is still
very complicated: map toNat ([Left ()]). Why doesn't GHC simplify this to
just [Ze]? Apparently because GHC doesn't simplify map over a single
element list.

*Question: why doesn't GHC optimise map f [x] to [f x]?*

Let's tell it to do so:

{-# RULES "map/map2" forall f x. map f (x:[]) = (f x):[] #-}
>

Now we're finally where we wanted:

lvl_ryA :: [EnumAlone.Nat]
> [GblId, Caf=NoCafRefs]
> lvl_ryA =
>   GHC.Types.:
>     @ EnumAlone.Nat EnumAlone.Ze (GHC.Types.[] @ EnumAlone.Nat)
>
> lvl3_ryD :: [EnumAlone.Nat]
> [GblId]
> lvl3_ryD =
>   GHC.Base.map
>     @ EnumAlone.Nat
>     @ EnumAlone.Nat
>     EnumAlone.Su
>     EnumAlone.enumNatFromRep
>
> EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
> [GblId, Str=DmdType]
> EnumAlone.enumNatFromRep =
>   EnumAlone.||| @ EnumAlone.Nat lvl_ryA lvl3_ryD
>

This is what I wanted: no more representation types (Either or ()), and the
code looks exactly like what is generated for the handwritten enumNat.

More realistic generic programming

Now let's see if we can transport this to the setting of a generic
programming library. I'll use a bare-bones version of GHC.Generics:

infixr 5 :+:
> infixr 6 :*:
>
> data U          = U              deriving (Show, Read)
> data a :+: b    = L a | R b      deriving (Show, Read)
> data a :*: b    = a :*: b        deriving (Show, Read)
> newtype Var a   = Var a          deriving (Show, Read)
> newtype Rec a   = Rec a          deriving (Show, Read)
>
> class Representable a where
>   type Rep a
>   to   :: Rep a -> a
>   from :: a -> Rep a
>

Let's represent Nat in this library:

instance Representable Nat where
>   type Rep Nat = U :+: (Rec Nat)
>   from Ze            = L U
>   from (Su n)        = R (Rec n)
>   to (L U)       = Ze
>   to (R (Rec n)) = Su n
>

(Note, in particular, that we do not need INLINE pragmas on the from/to
methods. This might just be because GHC thinks these are small and inlines
them anyway, but in general we want to make sure they are inlined, so we
typically use pragmas there.)

Now we need to implement enumeration generically. We do this by giving an
instance for each representation type:

class GEnum' a where
>   genum' :: [a]
>
> instance GEnum' U where
>   {-# INLINE genum' #-}
>   genum' = [U]
>
> instance (GEnum a) => GEnum' (Rec a) where
>   {-# INLINE genum' #-}
>   genum' = map Rec genum
>
> instance (GEnum a) => GEnum' (Var a) where
>   {-# INLINE genum' #-}
>   genum' = map Var genum
>
> instance (GEnum' f, GEnum' g) => GEnum' (f :+: g) where
>   {-# INLINE genum' #-}
>   genum' = map L genum' ||| map R genum'
>
> instance (GEnum' f, GEnum' g) => GEnum' (f :*: g) where
>   {-# INLINE genum' #-}
>   --genum' = diag [ [ x :*: y | y <- genum' ] | x <- genum' ]
>   genum' = diag (map (\x -> map (\y -> x :*: y) genum') genum')
>

We explicitly tell GHC to inline each case, as before. Note that for
products I'm not using the more natural list comprehension syntax because I
don't quite understand how that gets translated into core.

In the cases for Var and Rec we use genum from the GEnum class:

class GEnum a where
>   genum :: [a]
>   {-# INLINE genum #-}
>   default genum :: (Representable a, GEnum' (Rep a)) => [a]
>   genum = map to genum'
>

GEnum' is the class used for instantiating the generic representation
types, and GEnum is used for user types. We use a default signature to
provide a default method that can be used when we have a Representable
instance for the type in question. This makes instantiating Nat very easy:

instance GEnum Nat
>

Unfortunately, the core code generated in this situation (with the same
RULES as before) is not nice at all:

Main.$fGEnumNat_$cgenum [Occ=LoopBreaker] :: [Base.Nat]
> [GblId, Str=DmdType]
> Main.$fGEnumNat_$cgenum =
>   GHC.Base.map
>     @ (Base.Rep Base.Nat)
>     @ Base.Nat
>     Base.$fRepresentableNat_$cto
>     (lvl37_r79y
>      `cast` (Sym (GEnum.NTCo:GEnum') <Base.U Base.:+: (Base.Rec Base.Nat)>
> ;
>                                        (GEnum.GEnum' (Sym
> (Base.TFCo:R:RepNat)) ;
>                                          GEnum.NTCo:GEnum' <Base.Rep
> Base.Nat>)
>              :: [Base.C Base.Nat_Ze_ Base.U
>                  Base.:+: Base.C Base.Nat_Su_ (Base.Rec Base.Nat)]
>                   ~#
>                 [Base.Rep Base.Nat]))
>

We see a map of the `to` function, which is definitely not what we want.
Oddly enough, if we give an explicit definition of genum for Nat, with the
inlined default...

instance GEnum Nat where genum = map to genum'
>

... then we get the optimised code we want:

lvl34_r79p :: [Base.Nat]
> [GblId, Caf=NoCafRefs]
> lvl34_r79p =
>   GHC.Types.: @ Base.Nat Base.Ze (GHC.Types.[] @ Base.Nat)
>
> lvl35_r79q :: [Base.Nat]
> [GblId]
> lvl35_r79q =
>   GHC.Base.map @ Base.Nat @ Base.Nat Base.Su Main.$fGEnumNat_$cgenum
>
> Main.$fGEnumNat_$cgenum [Occ=LoopBreaker] :: [Base.Nat]
> [GblId, Str=DmdType]
> Main.$fGEnumNat_$cgenum =
>   GEnum.||| @ Base.Nat lvl34_r79p lvl35_r79q
>

Again, no representation types, no `to`, just the same code that is
generated for enumNat. Perfect. But we had to avoid using the default
definition, which is a pity.

*Question: why won't GHC inline the default method of a class, even when I
have a pragma telling it to do so?*

Let's look at one more datatype, because Nat does not use products. So
let's consider trees:

data Tree a = Leaf | Bin a (Tree a) (Tree a)
>
> instance Representable (Tree a) where
>   type Rep (Tree a) = U :+: (Var a :*: Rec (Tree a) :*: Rec (Tree a))
>   from (Bin x l r) = R (Var x :*: Rec l :*: Rec r)
>   from Leaf        = L U
>   to (R (Var x :*: (Rec l) :*: (Rec r))) = Bin x l r
>   to (L U)                               = Leaf
>

We give a GEnum instance using the same trick as before:

instance GEnum (Tree Int) where genum = map to genum'
>

(For simplicity only for trees of integers.) The generated code for trees
is unfortunately not as nice:

a2_r79M
>   :: [Base.Rec (Base.Tree GHC.Types.Int)
>       Base.:*: Base.Rec (Base.Tree GHC.Types.Int)]
> [GblId, Str=DmdType]
> a2_r79M =
>   GEnum.diag
>     @ (Base.Rec (Base.Tree GHC.Types.Int)
>        Base.:*: Base.Rec (Base.Tree GHC.Types.Int))
>     lvl8_r79L
>
> lvl9_r79N :: [Base.Tree GHC.Types.Int]
> [GblId]
> lvl9_r79N =
>   GHC.Base.map
>     @ (Base.Rec (Base.Tree GHC.Types.Int)
>        Base.:*: Base.Rec (Base.Tree GHC.Types.Int))
>     @ (Base.Tree GHC.Types.Int)
>     lvl5_r79H
>     a2_r79M
>
> Main.$fGEnumTree_$cgenum [Occ=LoopBreaker]
>   :: [Base.Tree GHC.Types.Int]
> [GblId, Str=DmdType]
> Main.$fGEnumTree_$cgenum =
>   GEnum.||| @ (Base.Tree GHC.Types.Int) lvl4_r79G lvl9_r79N
>

Note how lvl9_r79N is a map over a2_r79M, and a2_r79M is a `diag`. Ok, we
need the free theorem of `diag` to tell GHC how to commute the `diag` with
map:

{-# RULES "ft/diag" forall f l. map f (diag l) = diag (map (map f) l) #-}
>

Unfortunately this doesn't change the generated core code. With some more
debugging looking at the generated code at each simplifier iteration, I
believe that this is because a2_r79M got lifted out too soon, prevent the
rule from applying. With some imagination I decided to try the
-fno-full-laziness flag to prevent let-floating. I'm not sure this is a
good idea in general, but in this particular case it gives much better
results:

a1_r72i :: [Base.Rec (Base.Tree GHC.Types.Int)]
> [GblId, Str=DmdType]
> a1_r72i =
>   GHC.Base.map
>     @ (Base.Tree GHC.Types.Int)
>     @ (Base.Rec (Base.Tree GHC.Types.Int))
>     ((\ (tpl_B1 :: Base.Tree GHC.Types.Int) -> tpl_B1)
>      `cast` (<Base.Tree GHC.Types.Int>
>              -> Sym (Base.NTCo:Rec <Base.Tree GHC.Types.Int>)
>              :: (Base.Tree GHC.Types.Int -> Base.Tree GHC.Types.Int)
>                   ~#
>                 (Base.Tree GHC.Types.Int -> Base.Rec (Base.Tree
> GHC.Types.Int))))
>     Main.$fGEnumTree_$cgenum
>
> Main.$fGEnumTree_$cgenum [Occ=LoopBreaker]
>   :: [Base.Tree GHC.Types.Int]
> [GblId, Str=DmdType]
> Main.$fGEnumTree_$cgenum =
>   GEnum.|||
>     @ (Base.Tree GHC.Types.Int)
>     (GHC.Types.:
>        @ (Base.Tree GHC.Types.Int)
>        (Base.Leaf @ GHC.Types.Int)
>        (GHC.Types.[] @ (Base.Tree GHC.Types.Int)))
>     (GEnum.diag
>        @ (Base.Tree GHC.Types.Int)
>        (GHC.Base.map
>           @ (Base.Rec (Base.Tree GHC.Types.Int))
>           @ [Base.Tree GHC.Types.Int]
>           (\ (x_a1yQ :: Base.Rec (Base.Tree GHC.Types.Int)) ->
>              GHC.Base.map
>                @ (Base.Rec (Base.Tree GHC.Types.Int))
>                @ (Base.Tree GHC.Types.Int)
>                (\ (x1_X1zB :: Base.Rec (Base.Tree GHC.Types.Int)) ->
>                   Base.Bin
>                     @ GHC.Types.Int
>                     a_r72h
>                     (x_a1yQ
>                      `cast` (Base.NTCo:Rec <Base.Tree GHC.Types.Int>
>                              :: Base.Rec (Base.Tree GHC.Types.Int) ~#
> Base.Tree GHC.Types.Int))
>                     (x1_X1zB
>                      `cast` (Base.NTCo:Rec <Base.Tree GHC.Types.Int>
>                              :: Base.Rec (Base.Tree GHC.Types.Int) ~#
> Base.Tree GHC.Types.Int)))
>                a1_r72i)
>           a1_r72i))
>

Note how our enum is now of the shape `[Leaf] ||| diag y`, which is good.
The only catch is that there are some `Rec`s still laying around, with
their associated newtype coercions, and a function a1_r72i that basically
wraps the recursive enumeration in a Rec, only to be unwrapped in the body
of `$fGEnumTree_$cgenum`. I don't know how to get GHC to simplify this code
any further.

*Question: why do I need -fno-full-laziness for the ft/diag rule to apply?*

*Question: why is GHC not getting rid of the Rec newtype in this case?*

I have also played with -O2, in particular because of the SpecConstr
optimisation, but found that it does not affect these particular examples
(perhaps it only becomes important with larger datatypes). I have also
experimented with phase control in the rewrite rules and the inline
pragmas, but didn't find it necessary for this example. In general, anyway,
my experience with the inliner is that it is extremely fragile, especially
across different GHC versions, and it's hard to get any guarantees of
optimisation. I have also played with the -funfolding-* options before,
with mixed results. [1] It's also a pity that certain flags are not
explained in detail in the user's manual [2,3], like -fliberate-case, and
-fspec-constr-count and threshold, for instance.

Thank you for reading this. Any insights are welcome. In particular, I'm
wondering if I might be missing some details regarding strictness.

Cheers,
Pedro

[1] http://dreixel.net/research/pdf/ogie.pdf
[2]
http://www.haskell.org/ghc/docs/latest/html/users_guide/flag-reference.html
[3]
http://www.haskell.org/ghc/docs/latest/html/users_guide/options-optimise.html#options-f
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/generics/attachments/20120222/ea5d0c3f/attachment-0001.htm>