Loop unrolling + fusion ?

Sun Mar 1 07:55:28 EST 2009

>> its loop unroller on this guy haven't succeeded. -funroll-loops and
>> -funroll-all-loops doesn't  touch it,
>
> That's because the C produced by GHC doesn't look like a loop to GCC.  This can be fixed but given 
> that we are moving away from -fvia-C  anyway, it probably isn't worth doing.

That was one of my questions in the optimization and rewrite rules
thread: shouldn't -fvia-C be supported (as a non-default option)
for at least as long as the alternative isn't a clear win in all cases?

If there are small changes that could make GHC-generated code
more palatable to GCC's optimizer, wouldn't that be worth doing?
Once -fvia-C is allowed to bitrot to the level of unoptimized
bootstraps only, we might never get the good performance route
back, so why not keep it in good shape as long as it offers real benefits?

> The problem with low-level loop optimisations is that in general, they  should be done at a low 
> level. Core is much too early for this. To  find out whether and how much to unroll a particular 
> loop, you must  take things like register pressure and instruction scheduling into  account. IMO, 
> the backend is the only reasonable place to do these  optimisations.

[1] is one example reference for this argument (late unrolling). And
since most compiler textbooks are oddly quiet about optimizations
and their interactions, the survey [2] might also be helpful (and [3]
has some pointers to more recent work).

However, I'd like to note that Core is rather different from
conventional language source-level code, so I would expect
benefits from source-level "unrolling", too: Core retains much
more of the high-level semantics, so both identifying loops and
applying library-specific optimizations after unrolling are much
easier here than at the basic block level in the backend.

After all, it is just the normal unfolding/inlining that forms the
starting point for so many of GHC's optimizations, which just
happens to be blind to recursive definitions at the moment.
Recursive definitions are quite widely used in Haskell code,
so this blindspot can't be good for the overall effectiveness
of GHC's optimizer. If one could mark recursive bindings
with a counter, to limit unfoldings according to a compiler
option, generalised loop unrolling would just be a consequence
of what GHC does anyway, right?

That doesn't change the point that, at the lower level, loop
unrolling interacts strongly with the target architecture, and
that some relevant information is not available at Core level.

But it might be better to do both Core-level unfolding (to
enable further Core2Core optimizations, independent of
platform, that might no longer be visible at backend level)
and backend-level unfolding and re-folding (to massage the
low-level flow graph into a shape suited for the target
architecture, about which the Core level has no information).

One might also expect that Core-level transformations
are affected by compiler flags which users select according
to their target architecture (common practice at the moment),
so Core2Core isn't entirely out of that loop, either;-)

It is worth noting that there is a third level of optimizations,
even after the backend, in modern hardware, as these notes
[4] for an Intel compiler's unrolling option document. And
since I'm collecting references, there's also Ian's TH [5] for
doing the unrolling even before Core.

Claus

[1] An Aggressive Approach to Loop Unrolling, 1995
    http://citeseer.ist.psu.edu/old/620489.html
[2] Compiler Transformations for High-Performance Computing,
    ACM Computing Surveys, 1994
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.4885
[3] http://en.wikipedia.org/wiki/Loop_transformation
[4] 
http://www.intel.com/software/products/compilers/flin/docs/main_for/mergedprojects/optaps_for/common/optaps_hlo_unrl.htm
[5] Unrolling and simplifying expressions with Template Haskell,
    Ian Lynagh, 2003
    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9813