[Haskell-cafe] Fusion for fun and profi (Was: newbie optimization question)

Sun Oct 28 16:25:19 EDT 2007

rendel:
> Prabhakar Ragde wrote:
> >divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> >main = print [i | i<-[1..10000], i == sum (divisors i)]
> 
> Jerzy Karczmarczuk wrote:
> >My point didn't concern that point. Haskell compiler cannot change an
> >algorithm using lists into something which deals with indexable arrays,
> >usually faster. Indexing may be faster than the indirection, and the
> >allocation of memory costs. And there is laziness...
> 
> This may be true, but it isn't relevant in this case, since the "obvious 
> c program" doesn't need any arrays, only two loops:
> 
> for (int i = 1; i <= 10000; i++) {
>   int sum = 0;
>   for (int j = 1; j < i; j++)
>     if (i % j == 0)
>       sum += i;
>   if (sum == i)
>     print(i);
> }
> 
> Loops can be expressed with lazy lists in Haskell. Therefore, the 
> presented Haskell program is perfectly equivalent to the "obvious c 
> program".

So what we would hope is that GHC could transform a set of composed lazy list functions
into a doubly nested strict loop in Int#... 

Let's see if we can get that result from GHC, using a bit of fusion.

First, to explain what is happening, let's first replace the `mod` call with
`rem`, which is faster, and then desugar the list comprehension and
enumerations syntax, to expose the underlying program:

    default(Int)

    divisors i   = filter (\j -> i `rem`j == 0)         (enumFromTo 1 (i-1))
    main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)

Looking at this we see some good chances for fusion to take place: the
enumFromTo should fuse twice with 'filter', using build/foldr fusion.

And with stream fusion, the left fold 'sum' should also fuse with pipeline that
results from divisors. So my prediction would be that this program would run
slightly faster with stream fusion. Let's see...

Compiling with -O2 and ghc 6.8 snapshot, with build/foldr fusion, we see
two fusion sites, as expected, and a spec-constr of 'sum:

    $ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats
    RuleFired
        1 SPEC Data.List.sum
        2 fold/build

Good, running this:

    $ time ./A-stream
    [6,28,496,8128]
    ./A-stream  1.29s user 0.02s system 99% cpu 1.319 total

Now we can try with stream fusion, using the stream-fusible list library
here:
    http://www.cse.unsw.edu.au/~dons/code/streams/list/

To use these list functions in preference to the default, we have to
import:

    import Prelude hiding (filter,sum,enumFromTo)
    import Data.List.Stream

and since the base library doesn't include stream fusible enumFromTo,
we'll have to write our own definition in terms of stream:

    import Data.Stream (enumFromToInt,unstream)
    enumFromTo i j = unstream (enumFromToInt i j)

Ok, that's easy. Compiling this, we hope to see 3 fusion sites, and all
heap-allocated Ints removed:

    $ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats -package list
    RuleFired
        2 filter -> fusible
        1 sumInt -> fusible
        1 sum spec Int
        3 STREAM stream/unstream fusion

Terrific! The 'sum' was specialised to Int, then translated to a stream
version, the two filters also were translated, then 3 fusion sites were found
and fused.  Our program should now run faster:

    $ time ./A-stream
    [6,28,496,8128]
    ./A-stream  1.23s user 0.01s system 99% cpu 1.251 total

And so it does, with no list allocated for the sum loop. In fact the entire program reduces
to a strict unboxed nested loop:

    unfold = Int# -> [Int]
    wloop_sum_sV5 :: Int# -> Int# -> Int#

So almost identical types to the C program (bar for the return [Int]). 

Finally, we can manually translate the C code into a confusing set of nested
loops with interleaved IO,

    main = loop 1 
      where
        loop !i | i > 10000 = return ()
                | otherwise = if i == go i 0 1 then print i >> loop (i+1) 
                                               else loop (i+1)

        go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
                                                else go i s (j+1)
                 | otherwise = s

And we get *no speed benefit* at all!

    time ./A-loop  1.24s user 0.00s system 98% cpu 1.256 total

So the lesson is: write in a high level style, and the compiler can do the work
for you. Or, GHC is pretty smart on high level code.

-- Don