optimization and rewrite rules questions
Claus Reinke
claus.reinke at talk21.com
Wed Feb 25 19:01:11 EST 2009
Okay, I've found a combination of incantations that happens to
work, but only for this particular example. So this does not solve
the original questions, and I'm still interested in suggestions. But it
does give a concrete example of what I'd like to be able to do (or
better, what GHC should be doing, and GCC apparently does),
in the hope that this helps the discussion along.
This message a bit lengthy, as I try to summarize the approach,
for those here who didn't follow the haskell-cafe thread earlier
(so this should be fairly self-contained) and include timings.
1. Old things first, the TH macro for syntactic replication of calls
to 'f', each call getting its own counter 'i', and 'x' as the first parameter:
------------------------------------------------------------------
{-# LANGUAGE TemplateHaskell #-}
module Apply where
import Language.Haskell.TH.Syntax
apply i bound | i<bound = [| \f x -> $(apply (i+1) bound) f (f i x) |]
| otherwise = [| \f x -> x |]
------------------------------------------------------------------
2. Next, we use this to define an unrolling 'loop' combinator on
'Int', which counts from 'i' to 'max', applying 'body' to the count 'i'
and an accumulator 'acc'.
Thanks to 'apply' and TH, we can parameterize this code over
the number of unrollings 'N' (N=8 here), without having to write
out the syntactic unrolling ourselves (this is the only use of TH here,
so unlike earlier approaches, we do not implement loop unrolling
by any TH-based syntactic analysis, we just syntactically generate
a function that will do the unrolling semantically).
'loop' is marked INLINE, and is in a worker/wrapper form, so
that the wrapper will be inlined at the call site, bringing the 'body'
parameter into scope for the worker, in order to allow GHC to
apply further simplifications to the unfolded 'body' applications.
------------------------------------------------------------------
{-# LANGUAGE CPP #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -DN=8 -ddump-splices #-}
module Loop(loop) where
import Apply
{-# INLINE loop #-}
loop :: Int -> Int -> (Int -> Int -> Int) -> Int -> Int
loop i max body acc = loopW i acc
where
loopW :: Int -> Int -> Int
loopW !i !acc | i+N<=max = loopW (i+N) ($(apply (0::Int) N) (\j acc->body (i+j) acc) acc)
loopW !i !acc | i<=max = loopW (i+1) (body i acc)
| otherwise = acc
------------------------------------------------------------------
3. Finally, the main program, using 'loop' to calculate a simple sum:
------------------------------------------------------------------
{-# LANGUAGE MagicHash #-}
module Main where
import Loop
import GHC.Prim
{-# RULES
"re-assoc" [~1] forall v x y z. (x+#y)+#(z+#v) = ((x+#y)+#z)+#v
"collect1" [1] forall x y. (x+#y)+#x = (2#*#x)+#y
"collect2" [1] forall n x y. ((n*#x)+#y)+#x = ((n+#1#)*#x)+#y
"collect3" [1] forall n x y z. ((n*#x)+#y)+#z = (n*#x)+#(y+#z)
#-}
main = print $ loop 1 (10^9) body 0
body :: Int -> Int -> Int
body i acc = i+acc
------------------------------------------------------------------
4. If we ignore the RULES, we note the following:
- GHC nicely unrolls the 'loop' and simplifies the both the calls to
'body' and the overhead introduced by our use of 'apply' out of
the way, which gives some first improvements in speed, but
-fasm is noticeable slower than -fvia-C, so GCC does further
optimizations;
- a likely candidate are arithmetic laws (associativity/commutativity
of '+'), in order to enable constant folding
- of the two purposes of loop unrolling, (a) less admin overhead
and more straightforward code, (b) enable further optimizations
(just like any other unfolding/inlining, but on recursive code),
GHC -fasm, even with help from TH, only achieves the first.
The interesting bit of core is the one corresponding to the call to
'apply' (for N=8) in the definition of 'loopW' ('ww1_s18W' is the
counter variable, 'ww2_s190' is the accumulator):
$wloopW_s19d
(GHC.Prim.+# ww1_s18W 8)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 7)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 6)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 5)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 4)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 3)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 2)
(GHC.Prim.+#
(GHC.Prim.+# ww1_s18W 1)
(GHC.Prim.+# (GHC.Prim.+# ww1_s18W 0) ww2_s190))))))))
Note the various constants, interspersed with copies of the
counter variable. If we could sort those into two separate
groups, we could evaluate the sum of constants, and simplify
the repeated additions of the counter variable. Which is the
bit which I don't know how to do in general fashion.
But here's the non-general hack, using RULES:
I 'reassoc' reassociates the '+' applications from a tree to
a simple linear list
II 'collectN' perform the grouping into counter variable and
constants (by dead reckoning rather than proper analysis:-(
'collect1' and 'collect2' collect the counter variables,
'collect3' collects the constants
III phase control (rather unintuitive, imho) is used to keep
the 'reassoc' and 'collectN' rules from interfering with each
other, rule order is used to limit the applicability of 'collect3'
to expressions which do not match 'collect2''s left-hand side
(I think?)
II is where I'd like to be able to distinguish variables, constants,
and complex expressions in the left-hand sides of RULES, and
I and III are where I'd like control over the rewrite strategy, as
in strategy combinators.
The resulting core, for the same position as before ('ww1_s193'
is the counter variable, 'ww2_s197' is the accumulator):
$wloopW_s19k
(GHC.Prim.+# ww1_s193 8)
(GHC.Prim.+# (GHC.Prim.*# 8 ww1_s193) (GHC.Prim.+# 28 ww2_s197))
Nice. But I'd like to get there without cheating, with generally
applicable RULES. How?
Claus
=================================================
PS. Some numbers for general entertainment (times are best of 5,
GHC is ghc-6.11.20090118, GCC is 3.4.2 (mingw-special)):
------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs -ddump-simpl -fno-enable-rewrite-rules >loop.simpl
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m1.080s
user 0m0.015s
sys 0m0.031s
------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs -ddump-simpl >loop.simpl
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m0.569s
user 0m0.015s
sys 0m0.031s
------------------------------------------------------------------
$ ghc --make -O2 -fvia-C Main.hs -ddump-simpl -fno-enable-rewrite-rules >loop.simpl
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m0.641s
user 0m0.031s
sys 0m0.015s
------------------------------------------------------------------
Oh, and that '-fvia-C' does seem to pass '-O2' to GCC, but that does
not imply '-funroll-loops':
The compiler does not perform loop unrolling or function inlining
when you specify -O2.
If we repeat the experiment with the unrolling branch in 'loopW'
commented out, we don't seem to get unrolling from GCC either
(is this a case of GHC-generated code not being palatable to GCC,
or am I doing something wrong?), but -fasm is still slower than -fvia-C:
------------------------------------------------------------------
$ ghc --make -O2 -fvia-C Main.hs -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop ( Loop.hs, Loop.o )
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m3.437s
user 0m0.015s
sys 0m0.015s
------------------------------------------------------------------
$ ghc --make -O2 -fvia-C -optc -funroll-loops Main.hs -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop ( Loop.hs, Loop.o )
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m3.428s
user 0m0.015s
sys 0m0.015s
------------------------------------------------------------------
$ ghc --make -O2 -fasm Main.hs -ddump-simpl >loop.simpl
[1 of 3] Compiling Apply ( Apply.hs, Apply.o )
[2 of 3] Compiling Loop ( Loop.hs, Loop.o )
[3 of 3] Compiling Main ( Main.hs, Main.o )
Linking Main.exe ...
$ time ./Main.exe
-243309312
real 0m4.065s
user 0m0.015s
sys 0m0.015s
------------------------------------------------------------------
So, unrolling gives the main benefit (4.1 -> 1.1), but arithmetic
optimizations are still profitable (1.1 -> 0.6). The latter are somewhat
application-specific (if the loop operated on a Map, other RULES
would apply), but would seem useful (if they could be made to work)
independent of loop unrolling. And -fvia-C still tends to be faster than
-fasm.
More information about the Glasgow-haskell-users
mailing list