optimization and rewrite rules questions

Claus Reinke claus.reinke at talk21.com
Wed Feb 25 19:01:11 EST 2009


Okay, I've found a combination of incantations that happens to
work, but only for this particular example. So this does not solve
the original questions, and I'm still interested in suggestions. But it 
does give a concrete example of what I'd like to be able to do (or 
better, what GHC should be doing, and GCC apparently does), 
in the hope that this helps the discussion along. 

This message a bit lengthy, as I try to summarize the approach, 
for those here who didn't follow the haskell-cafe thread earlier
(so this should be fairly self-contained) and include timings.

1. Old things first, the TH macro for syntactic replication of calls
    to 'f', each call getting its own counter 'i', and 'x' as the first parameter:

------------------------------------------------------------------
{-# LANGUAGE TemplateHaskell #-}
module Apply where
import Language.Haskell.TH.Syntax
apply i bound | i<bound   = [| \f x -> $(apply (i+1) bound) f (f i x) |]
              | otherwise = [| \f x -> x |]
------------------------------------------------------------------

2. Next, we use this to define an unrolling 'loop' combinator on
    'Int', which counts from 'i' to 'max', applying 'body' to the count 'i' 
    and an accumulator 'acc'.

    Thanks to 'apply' and TH, we can parameterize this code over
    the number of unrollings 'N' (N=8 here), without having to write
    out the syntactic unrolling ourselves (this is the only use of TH here,
    so unlike earlier approaches, we do not implement loop unrolling
    by any TH-based syntactic analysis, we just syntactically generate 
    a function that will do the unrolling semantically). 

    'loop' is marked INLINE, and is in a worker/wrapper form, so 
    that the wrapper will be inlined at the call site, bringing the 'body' 
    parameter into scope for the worker, in order to allow GHC to
    apply further simplifications to the unfolded 'body' applications. 

------------------------------------------------------------------
{-# LANGUAGE CPP #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -DN=8 -ddump-splices #-}
module Loop(loop) where
import Apply

{-# INLINE loop #-}
loop :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int
loop i max body acc = loopW i acc
  where
  loopW :: Int -> Int -> Int
  loopW !i !acc | i+N<=max  = loopW (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc)
  loopW !i !acc | i<=max    = loopW (i+1) (body i acc)
                | otherwise = acc
------------------------------------------------------------------

3. Finally, the main program, using 'loop' to calculate a simple sum:

------------------------------------------------------------------
{-# LANGUAGE MagicHash #-}
module Main where
import Loop
import GHC.Prim

{-# RULES 
"re-assoc" [~1] forall v x y z. (x+#y)+#(z+#v) = ((x+#y)+#z)+#v
"collect1" [1]  forall x y.     (x+#y)+#x      = (2#*#x)+#y
"collect2" [1]  forall n x y.   ((n*#x)+#y)+#x = ((n+#1#)*#x)+#y
"collect3" [1]  forall n x y z. ((n*#x)+#y)+#z = (n*#x)+#(y+#z)
  #-}

main = print $ loop 1 (10^9) body 0

body :: Int -> Int -> Int
body i acc = i+acc
------------------------------------------------------------------

4. If we ignore the RULES, we note the following:
    - GHC nicely unrolls the 'loop' and simplifies the both the calls to 
        'body' and the overhead introduced by our use of 'apply' out of 
        the way, which gives some first improvements in speed, but 
        -fasm is noticeable slower than -fvia-C, so GCC does further 
        optimizations; 
    - a likely candidate are arithmetic laws (associativity/commutativity
        of '+'), in order to enable constant folding
    - of the two purposes of loop unrolling, (a) less admin overhead
        and more straightforward code, (b) enable further optimizations
        (just like any other unfolding/inlining, but on recursive code),
        GHC -fasm, even with help from TH, only achieves the first.
        
The interesting bit of core is the one corresponding to the call to
'apply' (for N=8) in the definition of 'loopW' ('ww1_s18W' is the
counter variable, 'ww2_s190' is the accumulator):

            $wloopW_s19d
              (GHC.Prim.+# ww1_s18W 8)
              (GHC.Prim.+#
                 (GHC.Prim.+# ww1_s18W 7)
                 (GHC.Prim.+#
                    (GHC.Prim.+# ww1_s18W 6)
                    (GHC.Prim.+#
                       (GHC.Prim.+# ww1_s18W 5)
                       (GHC.Prim.+#
                          (GHC.Prim.+# ww1_s18W 4)
                          (GHC.Prim.+#
                             (GHC.Prim.+# ww1_s18W 3)
                             (GHC.Prim.+#
                                (GHC.Prim.+# ww1_s18W 2)
                                (GHC.Prim.+#
                                   (GHC.Prim.+# ww1_s18W 1)
                                   (GHC.Prim.+# (GHC.Prim.+# ww1_s18W 0) ww2_s190))))))))

Note the various constants, interspersed with copies of the
counter variable. If we could sort those into two separate
groups, we could evaluate the sum of constants, and simplify
the repeated additions of the counter variable. Which is the
bit which I don't know how to do in general fashion.

But here's the non-general hack, using RULES:

I 'reassoc' reassociates the '+' applications from a tree to
    a simple linear list
II 'collectN' perform the grouping into counter variable and
    constants (by dead reckoning rather than proper analysis:-(
    'collect1' and 'collect2' collect the counter variables,
    'collect3' collects the constants
III phase control (rather unintuitive, imho) is used to keep 
    the 'reassoc' and 'collectN' rules from interfering with each 
    other, rule order is used to limit the applicability of 'collect3'
    to expressions which do not match 'collect2''s left-hand side
    (I think?)

II is where I'd like to be able to distinguish variables, constants,
and complex expressions in the left-hand sides of RULES, and
I and III are where I'd like control over the rewrite strategy, as 
in strategy combinators.

The resulting core, for the same position as before ('ww1_s193'
is the counter variable, 'ww2_s197' is the accumulator):

            $wloopW_s19k
              (GHC.Prim.+# ww1_s193 8)
              (GHC.Prim.+# (GHC.Prim.*# 8 ww1_s193) (GHC.Prim.+# 28 ww2_s197))

Nice. But I'd like to get there without cheating, with generally
applicable RULES. How?

Claus
=================================================
PS. Some numbers for general entertainment (times are best of 5,
    GHC is ghc-6.11.20090118, GCC is 3.4.2 (mingw-special)):

------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs -ddump-simpl -fno-enable-rewrite-rules >loop.simpl
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m1.080s
user    0m0.015s
sys     0m0.031s
------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs  -ddump-simpl >loop.simpl
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m0.569s
user    0m0.015s
sys     0m0.031s
------------------------------------------------------------------
$ ghc --make -O2 -fvia-C Main.hs -ddump-simpl -fno-enable-rewrite-rules >loop.simpl
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m0.641s
user    0m0.031s
sys     0m0.015s
------------------------------------------------------------------

Oh, and that '-fvia-C' does seem to pass '-O2' to GCC, but that does
not imply '-funroll-loops':

    The compiler does not perform loop unrolling or function inlining 
    when you specify -O2.

If we repeat the experiment with the unrolling branch in 'loopW'
commented out, we don't seem to get unrolling from GCC either
(is this a case of GHC-generated code not being palatable to GCC,
or am I doing something wrong?), but -fasm is still slower than -fvia-C:

------------------------------------------------------------------
$ ghc --make -O2 -fvia-C Main.hs -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply            ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop             ( Loop.hs, Loop.o )
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m3.437s
user    0m0.015s
sys     0m0.015s
------------------------------------------------------------------
$ ghc --make -O2 -fvia-C -optc -funroll-loops Main.hs -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply            ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop             ( Loop.hs, Loop.o )
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m3.428s
user    0m0.015s
sys     0m0.015s
------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs  -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply            ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop             ( Loop.hs, Loop.o )
[3 of 3] Compiling Main             ( Main.hs, Main.o )
Linking Main.exe ...

$ time ./Main.exe
-243309312

real    0m4.065s
user    0m0.015s
sys     0m0.015s
------------------------------------------------------------------

So, unrolling gives the main benefit (4.1 -> 1.1), but arithmetic
optimizations are still profitable (1.1 -> 0.6). The latter are somewhat
application-specific (if the loop operated on a Map, other RULES
would apply), but would seem useful (if they could be made to work)
independent of loop unrolling. And -fvia-C still tends to be faster than 
-fasm.



More information about the Glasgow-haskell-users mailing list