[Haskell-cafe] Fusing foldr's

Sat Oct 27 19:09:55 EDT 2007

On 10/26/07, Dan Weston <westondan at imageworks.com> wrote:
> Thanks for letting me know about the Data.Strict library on Hackage. I
> will definitely make use of that! BTW, you left out an import
> Data.List(foldl') in your example.
>
Yes, Data.Strict can be pretty handy for getting the right strictness.
Sorry about the missing import.

> My timing test is an order of magnitude worse than yours. Do you have an
> extra zero in your list endpoint?
>
>  > I fed these functions to ghc with the -O2 and -threaded flags and
>  > timed them using the list [1..10000000]. The result (best times out of
>  > several runs):
>  > avg4: 284 ms
>  > avgS: 184 ms
>  > avgP: 248 ms
>
> Using ghc -threaded -O2 --make Avg.hs for ghc 6.6.1, I ran your tests on
> [1..10000000] and got the user times:
>
> avg4: 12.75 s
> avgS:  3.65 s
> avgP: 15.56 s
>
> The funny thing is that avg4/avgS = 3.5 for and only 1.5 for you. I
> understand that with only 1 processor my avgP time may be twice yours,
> but not the avgS or avg4.
>
Oooops.. My numbers are totally bogus. I had code that looked like the
following:
\begin{code}
main = do
	time avg4 [1..10000000]
	time avg4 [1..10000000]
	time avg4 [1..10000000]
	time avgS [1..10000000]
	time avgS [1..10000000]
	time avgS [1..10000000]
	time avgP [1..10000000]
	time avgP [1..10000000]
	time avgP [1..10000000]
\end{code}
Not very elegant I know but I thought it would do the job. Apparently
I was wrong. GHC does common subexpression elimination on all the
lists so they're all shared between the different calls. Of course,
the first function call would always take long time but I ignored it,
thinking it was some anomaly. Anyway, I was totally sure that GHC only
did cse on constructor expressions and not on arbitrary computations.
Guess I was wrong. A little searching revealed the following quote by
Simon PJ:

> GHC does a very simple form of CSE. If it sees
>        let x = e in ....e....
> it replaces the inner 'e' by x.  But that's all at the moment.

Lesson learned.

Less bogus timing:
avg4: 18.0s
avgS: 2.2s
avgP: 17.4s

OK, so these figures make an even stronger case for my conclusion :-)
Single traversal can be much faster than multiple traversals *when
done right*.

> I have the following machine:
>
> Main memory size: 2026 Mbytes
> Num Processors: 1
> Processor Type: Intel(R) Xeon(TM) CPU 2.80GHz x32
> Clock Speed: 2790 MHZ
>
In case you're still interested my machine looks like this:

Memory: 2026 Mbytes
Processor: AMD Turion(tm) 64 X2 Mobile Technology TL-56
Clock Speed: 1800MHz

All the best,

/Josef