[Haskell-cafe] GHC predictability

Mon May 12 22:18:49 EDT 2008

Hello,

> I offer up the following example:
> 
This is an instructive example.

>   mean xs = sum xs / length xs
> 
In order to type-check, I actually need to write something like:

    mean xs = sum xs / fromIntegral (length xs)

There are other ways of get the numeric types to match correctly, but this 
is fairly general. 

Then, I immediately blow my stack if I try something like: 

    mean [1..1000000000].

The culprit is actually sum which is defined in the base libraries as 
either a foldl or a direct recursion depending on a compiler flag. In 
either case, the code is not strict enough; just trying to compute:

     sum [1..10000000] 

blows the stack. This can be easily fixed by defining a suitable strict 
sum:

    sum' = foldl' (+) 0

and now sum' has constant space. We could try to redefine mean using sum':

    mean1 xs = sum' xs / fromIntegral (length xs)

but this still gobbles up memory. The reason is that xs is used twice and 
cannot be discarded as it is generated. So we must move to a direct fold, 
as you did, to get a space efficient mean.

> If we now rearrange this to
> 
>   mean = (\(s,n) -> s / n) . foldr (\x (s,n) -> let s' = s+x; n' = n+1 
> in s' `seq` n' `seq` (s', n')) (0,0)
> 
> and run the same example, and watch it run in constant space.
> 
This code actually blows the stack on my machine just like the first naive 
mean. Foldl is perhaps more intuitive  to use here, since we are summing 
the numbers as we encounter them while walking down the list, and there is 
a strict version, foldl', provided in the base libraries.

    mean2 = uncurry (/) . foldl' (\(s,n) x -> (s+x, n+1)) (0,0)

However, this still gobbles up memory... the reason is that pairs are 
lazy. So we need a way to force the (s+x) and (n+1). An easy, and 
unobtrusive way to do this is to use strict pattern matching:

    mean2 = uncurry (/) . foldl' (\(!s, !n) x -> (s+x, n+1)) (0,0)

Now we can run:

    mean2 [1..1000000000]

in constant space.

While using an explicit foldl is less readable than the direct version 
(especially to a beginner), it is a standard functional idiom. 
Furthermore, a good understanding of lazy evaluation should be enough to 
guide you to using the strict foldl' and then then strict patterns. Is 
this a reasonable analysis?

Also, we've made no attempt to address speed. However, I would argue that 
the code's performance time is predictable-- it grows linearly with the 
size of the list. Improving the performance time is another matter that 
requires knowing about the internal representation of the various types 
being used.

-Jeff

---

This e-mail may contain confidential and/or privileged information. If you 
are not the intended recipient (or have received this e-mail in error) 
please notify the sender immediately and destroy this e-mail. Any 
unauthorized copying, disclosure or distribution of the material in this 
e-mail is strictly forbidden.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20080512/bebb8e94/attachment.htm