[Haskell-cafe] vector stream fusion, inlining and compilation time

Fri Mar 5 11:10:10 EST 2010

> This is a general problem when working with RULES-based
> optimisations. Here is an example of what happens: suppose we have
> 
> foo :: Vector Int -> Vector Int
> foo xs = map (+1) xs
> 
> Now, GHC will generate a nice tight loop for this but if in a
> different module, we have something like this:
> 
> bar xs = foo (foo xs)
> 
> then this won't fuse because (a) foo won't be inlined and (b) even if
> GHC did inline here, it would inline the nice tight loop which can't
> possibly fuse instead of the original map which can. By slapping an
> INLINE pragma on foo, you're telling GHC to (almost) always inline the
> function and to use the original definition for inlining, thus giving
> it a chance to fuse.

thanks for the insight, roman!

>> the downside after adding the INLINE pragmas is that now some of my modules take
>> _really_ long to compile (up to a couple of minutes); any ideas where i can
>> start looking to bring the compilation times down again?
> 
> Alas, stream fusion (and fusion in general, I guess) requires what I
> would call whole loop compilation - you need to inline everything into
> loops. That tends to be slow. I don't know what your code looks like
> but you could try to control inlining a bit more. For instance, if you
> have something like this:
> 
> foo ... = ... map f xs ...
>   where
>     f x = ...
> 
> you could tell GHC not to inline f until fairly late in the game by adding
> 
>   {-# INLINE [0] f #-}
> 
> to the where clause. This helps sometimes.

thanks, i'll check it out.

> I'm surprised -Odph doesn't produce faster code than -O2. In any
> case, you could try turning these flags on individually (esp.
> -fno-method-sharing and the spec-constr flags) to see how they affect
> performance and compilation times.

in the end it turned out that i had forgotten another INLINE pragma and in my
crude benchmarks -O2 and -Odph give basically the same results, -O2 being a
little faster. i hope i'll have time next week to do proper benchmarks, and i
also want to try ghc HEAD with the llvm patches.

        conv_1  conv_2  conv_3
-Odph   1.004   2.715   1.096
-O2     1.000   2.710   1.097

i'm still curious, though, why my three versions of direct convolution perform
so differently (see attached file). in particular, i somehow expected conv_3 to
be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had
a look at the core yet, mainly because i'm lacking the expertise ...

<sk>
-------------- next part --------------
import           Data.Vector.Generic (Vector, (!))
import qualified Data.Vector.Generic as V

conv_1, conv_2, conv_3 :: (Num a, Vector v a) => v a -> v a -> v a
{-# INLINE conv_1 #-}
conv_1 h x = V.generate (l+m) f
    where
        m = V.length h - 1
        l = V.length x
        {-# INLINE f #-}
        f n = g 0 n (max 0 (n-l+1)) (min n m)
        g y n m k = if m <= k
                    then let y' = y + (h ! m) * (x ! (n-m))
                         in y' `seq` g y' n (m+1) k
                    else y
{-# INLINE conv_2 #-}
conv_2 h x = V.generate (l+m) f
    where
        l = V.length x
        m = V.length h - 1
        {-# INLINE f #-}
        f n = let j = max 0 (n-l+1)
                  k = (min n m) - j + 1
              in V.sum (V.zipWith (*) (V.slice j k h) (V.reverse (V.slice (n - j - k + 1) k x)))
{-# INLINE conv_3 #-}
conv_3 h x = V.generate (l+m-1) f
    where
        m   = V.length h
        l   = V.length x
        p   = V.replicate (m-1) 0
        x'  = p ++ x ++ p
        {-# INLINE f #-}
        f i = V.sum (V.zipWith (*) (V.reverse h) (V.slice i m x'))