Loop unrolling + fusion ?

Sun Mar 1 16:17:00 EST 2009

> Yes - this is why my use of a kind of unrolling fixes concatMap for
> streams, because GHC is able to specialise the "unrolled" function
> body on a particular lambda abstraction. However, this is really a
> somewhat seperate issue than plain unrolling, as we just want to be
> able to /specialise/ recursive functions on particular arguments
> rather than reduce loop overhead / reassociate arithmetic over several
> iterations.

What is the issue with concatMap? It sounds like your specialization
is based on the recursion equivalent to loop peeling (unrolling the 
"non-recursive" calls, the entry points into the recursion), a close 
variant of loop unrolling (unrolling the recursive calls, inside the loop
body). If followed by the static argument transformation, that might 
cover the majority of hand-written worker-wrapper pairs (replacing 
manual by compiler optimization is always nice). 

So, instead of splitting recursive 'f' into 'fWorker' and 'fWrapper', 
with an INLINE pragma for 'fWrapper', one might in future be able 
just to say '{-# INLINE f PEEL 1 UNROLL 0 #-}' or, if unrolling
is also desirable '{-# INLINE f PEEL 1 UNROLL 8 #-}'? And 
GHC would do the work, by unfolding the entry points once (the
inlining of the wrapper), unfolding the recursive calls 8 times (the
loop unrolling), and taking the INLINE PEEL pragma also as a 
hint to try the static argument transformation.

> This is why the static argument transformation is such a big win (as
> I've mentioned before, 12% decrease in nofib runtime if you use it) -
> because it finds instances of recursive definitions where it's a
> REALLY GOOD idea to specialise on a particular argument (since that
> argument is actually /invariant/) and gives GHC the opportunity to
> specialise on it by creating a nonrecursive wrapper around the
> recursive worker loop.
> 
> In general, the compiler wants to be able to determine the structure
> of the argument of a loop body in a more fine grained way than just
> "invariant vs non-invariant" as SAT does. A particularly tempting
> example of an optimisation you could do if we dealt with recursive
> functions better is strength reduction. This is part of what I'm
> looking at implementing for GHC currently.

It seems that strength reduction could be seen as loop restructuring
in the small: a multiplication, if seen as a repeated addition, can be
unrolled, or the implicit adding loop can be fused with the explicit 
loop in which multiplication is called on (eg figure 7 in the ACM 
survey paper I mentioned). That way, no separate framework
would be needed for strength reduction. Btw, is that also what 
happened in the -fvia-C path of Don's initial example in this 
thread (I don't know how to read those leaqs, but the imulq is gone)?

But all these follow-on optimizations enabled by unfolding recursive
definitions seem to require further thought and design, whereas
user-controlled recursion unfolding (both peel and unroll) seems
to offer immediate benefits. Is that part of your current work?
Do you forsee any problems with the implementation, or with
the API I suggested above (adding PEEL and UNROLL options
to INLINE pragmas, to make them effective on recursive
definitions as well)?

Claus