[Haskell-cafe] vector stream fusion, inlining and compilation time
stefan kersten
sk at k-hornz.de
Fri Mar 5 11:10:10 EST 2010
> This is a general problem when working with RULES-based
> optimisations. Here is an example of what happens: suppose we have
>
> foo :: Vector Int -> Vector Int
> foo xs = map (+1) xs
>
> Now, GHC will generate a nice tight loop for this but if in a
> different module, we have something like this:
>
> bar xs = foo (foo xs)
>
> then this won't fuse because (a) foo won't be inlined and (b) even if
> GHC did inline here, it would inline the nice tight loop which can't
> possibly fuse instead of the original map which can. By slapping an
> INLINE pragma on foo, you're telling GHC to (almost) always inline the
> function and to use the original definition for inlining, thus giving
> it a chance to fuse.
thanks for the insight, roman!
>> the downside after adding the INLINE pragmas is that now some of my modules take
>> _really_ long to compile (up to a couple of minutes); any ideas where i can
>> start looking to bring the compilation times down again?
>
> Alas, stream fusion (and fusion in general, I guess) requires what I
> would call whole loop compilation - you need to inline everything into
> loops. That tends to be slow. I don't know what your code looks like
> but you could try to control inlining a bit more. For instance, if you
> have something like this:
>
> foo ... = ... map f xs ...
> where
> f x = ...
>
> you could tell GHC not to inline f until fairly late in the game by adding
>
> {-# INLINE [0] f #-}
>
> to the where clause. This helps sometimes.
thanks, i'll check it out.
> I'm surprised -Odph doesn't produce faster code than -O2. In any
> case, you could try turning these flags on individually (esp.
> -fno-method-sharing and the spec-constr flags) to see how they affect
> performance and compilation times.
in the end it turned out that i had forgotten another INLINE pragma and in my
crude benchmarks -O2 and -Odph give basically the same results, -O2 being a
little faster. i hope i'll have time next week to do proper benchmarks, and i
also want to try ghc HEAD with the llvm patches.
conv_1 conv_2 conv_3
-Odph 1.004 2.715 1.096
-O2 1.000 2.710 1.097
i'm still curious, though, why my three versions of direct convolution perform
so differently (see attached file). in particular, i somehow expected conv_3 to
be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had
a look at the core yet, mainly because i'm lacking the expertise ...
<sk>
-------------- next part --------------
import Data.Vector.Generic (Vector, (!))
import qualified Data.Vector.Generic as V
conv_1, conv_2, conv_3 :: (Num a, Vector v a) => v a -> v a -> v a
{-# INLINE conv_1 #-}
conv_1 h x = V.generate (l+m) f
where
m = V.length h - 1
l = V.length x
{-# INLINE f #-}
f n = g 0 n (max 0 (n-l+1)) (min n m)
g y n m k = if m <= k
then let y' = y + (h ! m) * (x ! (n-m))
in y' `seq` g y' n (m+1) k
else y
{-# INLINE conv_2 #-}
conv_2 h x = V.generate (l+m) f
where
l = V.length x
m = V.length h - 1
{-# INLINE f #-}
f n = let j = max 0 (n-l+1)
k = (min n m) - j + 1
in V.sum (V.zipWith (*) (V.slice j k h) (V.reverse (V.slice (n - j - k + 1) k x)))
{-# INLINE conv_3 #-}
conv_3 h x = V.generate (l+m-1) f
where
m = V.length h
l = V.length x
p = V.replicate (m-1) 0
x' = p ++ x ++ p
{-# INLINE f #-}
f i = V.sum (V.zipWith (*) (V.reverse h) (V.slice i m x'))
More information about the Haskell-Cafe
mailing list