[Haskell-cafe] Known Unknowns

Wed Feb 1 00:00:04 EST 2006

joelkoerwer:
> 
>    Thanks Chris. I was actually asking about analyzing Core
>    output in general. I'm well aware of the problems we're
>    having with the nbody entry.
>    I'm convinced my list based version can go faster than it is
>    now. That's why I was asking if Don could put together a few
>    notes on how to optimize inner loops using -ddump-simpl and
>    the resulting Core code.

Here's a brief introduction. I intend to write up (on the performance page on
the wiki) a list of things we've done to improve the shootout entries. N.B
we're now the 3rd *fastest* language, behind C and only a little behind D (a C
varient) !!

Consider the partial sums problem: 
    wiki: http://www.haskell.org/hawiki/PartialSumsEntry
    site: http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2

What follows is a discussion of the steps I took to improve the
performance of this code.

Here's the naive translation of the Clean entry (which was fairly quick):
Lots of math in a tight loop.

> import System; import Numeric
> 
> main = do n <- getArgs >>= readIO . head
>           let sums     = loop 1 n 1 0 0 0 0 0 0 0 0 0
>               fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t
>           mapM_ (fn :: (Double, String) -> IO ()) (zip sums names)
> 
> names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills"
>         , "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"]
> 
> loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
>     | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
>     | otherwise = loop (k+1) n (-alt)
>                        (a1 + (2/3) ** (k-1))
>                        (a2 + k ** (-0.5))
>                        (a3 + 1 / (k * (k + 1)))
>                        (a4 + 1 / (k*k*k * sin k * sin k))
>                        (a5 + 1 / (k*k*k * cos k * cos k))
>                        (a6 + 1 / k)
>                        (a7 + 1 / (k*k))
>                        (a8 + alt / k)
>                        (a9 + alt / (2 * k - 1))

Compiled with "-O2". However, the performance is _really_ bad :/ Somewhere
greater than 128M heap, in fact eventually running out of memory on my laptop.
A classic space leak.

(2) So look at the generated core. "ghc -o naive Naive.hs -O2  -ddump-simpl | less"
And we find that our loop has the following type:

>     $sloop_r2U6 :: GHC.Prim.Double#
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Float.Double
>                    -> GHC.Prim.Double#
>                    -> [GHC.Float.Double]

Hmm. Ok, I certainly don't want boxed doubles in such a tight loop.

(3) My next step is to encourage GHC to unbox this loop, by providing some
strictness annotations. Now the loop looks like this:

> loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
>     | () !k !n !False = undefined
>     | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
>     | otherwise = loop (k+1) n (-alt)
>                        (a1 + (2/3) ** (k-1))
>                        (a2 + k ** (-0.5))
>                        (a3 + 1 / (k * (k + 1)))
>                        (a4 + 1 / (k*k*k * sin k * sin k))
>                        (a5 + 1 / (k*k*k * cos k * cos k))
>                        (a6 + 1 / k)
>                        (a7 + 1 / (k*k))
>                        (a8 + alt / k)
>                        (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y

I've played a little game here, using ! for `seq`, reminiscent of the new
!-pattern proposal for strictness. Let's see how this compiles. Here's the Core:

> $sloop_r2Vh :: GHC.Prim.Double#
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Float.Double
>                -> GHC.Prim.Double#
>                -> [GHC.Float.Double]

Ok, so it unboxed one extra argument. Let's see if we can get them all unboxed.
Strictify all args, and GHC produces an inner loop of:

> $sloop_r2WS :: GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> GHC.Prim.Double#
>                -> [GHC.Float.Double]

Ah! perfect. Let's see how that runs:
    $ ghc Naive.hs -O2 -no-recomp                   
    $ time ./a.out 2500000    
    3.000000000     (2/3)^k
    3160.817621887  k^-0.5
    0.999999600     1/k(k+1)
    30.314541510    Flint Hills
    42.995233998    Cookson Hills
    15.309017155    Harmonic
    1.644933667     Riemann Zeta
    0.693146981     Alternating Harmonic
    0.785398063     Gregory
    ./a.out 2500000  4.45s user 0.02s system 99% cpu 4.482 total

(4) Not too bad. No space leak and quite zippy. But let's see what more can be done.
First, double arithmetic usually benefits from -fexcess-precision, and cranking
up the flags to gcc:
    paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp
    paprika$ time ./a.out 2500000                                                    
    3.000000000     (2/3)^k
    3160.817621887  k^-0.5
    0.999999600     1/k(k+1)
    30.314541510    Flint Hills
    42.995233998    Cookson Hills
    15.309017155    Harmonic
    1.644933667     Riemann Zeta
    0.693146981     Alternating Harmonic
    0.785398063     Gregory
    ./a.out 2500000  3.71s user 0.01s system 99% cpu 3.726 total

Even better. Now, let's dive into the Core to see if there are any optimisation
opportunites that GHC missed. So add -ddump-simpl and peruse the output.

(5) Looking at the Core, I see firstly that some of the common subexpressions haven't been
factored out:
        case [GHC.Float.Double] GHC.Prim./## 1.0
                      (GHC.Prim.*## (GHC.Prim.*##
                                        (GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS)
                                        (GHC.Prim.sinDouble# sc10_s2VS))
                                     (GHC.Prim.sinDouble# sc10_s2VS))

Multiple calls to sin. Hmm  :/ And similar for cos, as well as k*k. Not sure
why GHC isn't removing these (SimonM?), so let's do that by hand, and the inner loop
looks like:

> loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
>     | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined
>     | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
>     | otherwise = loop (k+1) n (-alt)
>                        (a1 + (2/3) ** (k-1))
>                        (a2 + k ** (-0.5))
>                        (a3 + 1 / (k * (k + 1)))
>                        (a4 + 1 / (k3 * sk * sk))
>                        (a5 + 1 / (k3 * ck * ck))
>                        (a6 + 1 / k)
>                        (a7 + 1 / k2)
>                        (a8 + alt / k)
>                        (a9 + alt / (2 * k - 1))
>     where sk = sin k ; ck = cos k; k2 = k * k; k3 = k2 * k; x ! y = x `seq` y

looking at the Core shows the sins are now allocated and shared:

>                   let a9_s2MI :: GHC.Prim.Double#
>                       a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa

So the common expressions are floated out, and it now runs:
    paprika$ time ./a.out 2500000                                                                         3.000000000     (2/3)^k      
    3160.817621887  k^-0.5
    0.999999600     1/k(k+1)
    30.314541510    Flint Hills
    42.995233998    Cookson Hills
    15.309017155    Harmonic
    1.644933667     Riemann Zeta
    0.693146981     Alternating Harmonic
    0.785398063     Gregory
    ./a.out 2500000  3.29s user 0.00s system 99% cpu 3.290 total

Faster. So we gained 12% by floating out those common expressions.

(6) Finally, another trick. When I checked the C entry, it used an integer for
the k parameter to the loop, and cast it to a double for the math each time
around, so perhaps we can make it an Int parameter. And secondly, the alt
parameter only has it's sign flipped each time, so perhaps we can factor out
the alt / k arg (it's either 1 / k or -1 on k), saving a division. So the final
loop looks like:

> loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
>     | () !i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined
>     | k > n     = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
>     | otherwise = loop (i+1) n (-alt)
>                        (a1 + (2/3) ** (k-1))
>                        (a2 + k ** (-0.5))
>                        (a3 + 1 / (k * (k + 1)))
>                        (a4 + 1 / (k3 * sk * sk))
>                        (a5 + 1 / (k3 * ck * ck))
>                        (a6 + dk)
>                        (a7 + 1 / k2)
>                        (a8 + alt * dk)
>                        (a9 + alt / (2 * k - 1)) 
>     where k = fromIntegral i
>           dk = 1/k
>           sk = sin k
>           ck = cos k
>           k2 = k * k
>           k3 = k2 * k
>           x ! y = x `seq` y

Checking the generated C code shows that the same operations are generated as the C entry.
And it runs:
    $ time ./a.out 2500000
    3.000000000     (2/3)^k
    3160.817621887  k^-0.5
    0.999999600     1/k(k+1)
    30.314541510    Flint Hills
    42.995233998    Cookson Hills
    15.309017155    Harmonic
    1.644933667     Riemann Zeta
    0.693146981     Alternating Harmonic
    0.785398063     Gregory
    ./a.out 2500000  3.17s user 0.01s system 99% cpu 3.206 total

This entry in fact runs faster than the original (though not the new vectorised
entry) optimised C entry (and faster than all other languages):
    http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all

So, by carefully tweaking things, we first squished a space leak, and then gained another
30%.

In summary:
  * Check the Core that is generated
  * Watch out for optimisations that are missed
  * Read the generated C for the tight loops.
  * Make sure tight loops are unboxed
  * Use -fexcess-precision and -optc-ffast-math for doubles

This is roughly the process I used for the other shootout entries.

Cheers,
  Don