Recursive functions and constant parameter closures (inlining/strictness analyzer question)

Fri Jun 20 17:04:46 EDT 2008

On Friday 30 May 2008, Duncan Coutts wrote:
> This is for two reasons. One is because your second foldl' is directly
> recursive so does not get inlined. The static argument transformation it
> what you're doing manually to turn the latter into the former. The SAT
> is implemented in ghc 6.9 (though it seems to be having integration
> problems).

Apologies for replying to this thread after around a month, but I've been 
looking at performance of code with and without this transformation a bit 
lately, and I'm rather curious what is being planned to go on in GHC 6.10 
with regard to it.

Specifically, will GHC simply always perform the static argument transform, or 
will it have some kind of heuristic to decide when it's useful? It seems, 
according to some tests I've done, that it's not always a win, and is 
sometimes a loss (and not just with regard to code duplication). I've been 
hard pressed to come up with a definite rule for when it's a benefit, other 
than simply testing and seeing what works.

For instance, consider some code from a recent benchmark I posted (I'm running 
this all under 6.8.2):

    bench :: Int -> Int -> ST s ()
    bench (I# k) (I# n) = ST go
     where
     go s = case sizeOf (0 :: Int)        of { I# w ->
            case newByteArray# (n *# w) s of { (# s, arr #) ->
            case fill arr n s             of { s ->
            go' arr k s } } }
     go' arr 0# s = (# s, () #)
     go' arr k  s = case reverse arr 0# (n -# 1#) s of { s ->
                    go' arr (k -# 1#) s }

    reverse :: MutableByteArray# s -> Int# -> Int# -> State# s -> State# s
    reverse arr i j s
      | i <# j    = case readIntArray#  arr i    s of { (# s, ei #) ->
                    case readIntArray#  arr j    s of { (# s, ej #) ->
                    case writeIntArray# arr j ei s of { s ->
                    case writeIntArray# arr i ej s of { s ->
                    reverse arr (i +# 1#) (j -# 1#) s } } } }
      | otherwise = s

The above is without the SAT on 'arr', obviously. This code runs fast. Doing 
SAT on reverse, but not on go' causes lots of heap allocation (for high 
iteration counts), slowing things way down. Doing it on go' but not reverse 
yields results about the same as not doing it at all. Doing it on both go' 
*and* reverse eliminates the heap allocation of the first case, *but* it 
results in overall slower code (which may, in fact, invalidate my bug about 
MBA# performance; I'll have to check. Just goes to show, even the most 
straightforward benchmark may not be measuring what you intend to measure).

Another example is this function from the MBA# version of the fannkuch 
benchmark:

    shift :: MutableByteArray# s -> Int# -> State# s -> State# s
    shift arr r s = case readIntArray# arr 0# s of { (# s, p0 #) ->
                    case go arr 0# r s of { s ->
                    writeIntArray# arr r p0 s } }
     where
     go arr i r s
       | i <# r    = case readIntArray# arr (i +# 1#) s of { (# s, e #) ->
                     case writeIntArray# arr i e s of { s ->
                     go arr (i +# 1#) r s } }
       | otherwise = s
    {-# INLINE shift #-}

Now, obviously, arr and r are static arguments for go (already in scope, no 
less), so one would likely write go without them (it's certainly what I did 
initially). However, eliminating either one of them causes heap allocation 
and, thus, slowdowns (eliminating both is even worse).

So, I suppose my question is: will GHC be better at this than I am? :) Will it 
know that performing the transform on shift above would cause extra heap 
allocation/slowdowns? Will it know that transforming reverse (and go') is 
slower than not doing so (I suppose it may be suspicious that this is all 
low-level code, but I've tried fiddling with higher-level code that I know 
gets compiled to this sort of code, and the above results seemed to hold)?

I'm also not yet skilled enough at reading assembly to figure out what exactly 
is causing all this. Are values getting kicked out of registers or some such?

Cheers,
-- Dan