[Haskell-cafe] Why Haskell?

Sun Jul 23 04:07:02 EDT 2006

Matthew Bromberg wrote:
>  > I used what I thought, initially was an elegant contruction technique in
>  > Haskell.  Something like this
>  > do
>  > ...
>  >     sequence $ [ reffill b s | s <- [0..(fi temits)-1], b <- [0..(fi 
> nc)-1]]
>  > ...(push list on to matrix stack)
> 
> Try the sequence_ (note the underscore) function, it should be a big win 
> here.
> Cheers,
> Spencer Janssen
> 

> Now thats interesting.  I can see that this function is more appropriate 
> since I do not need to retrieve data from the IO monad,
> but what I don't understand is why it's actually faster.  I will give it 
> a try and test it on a large set to see if things change.
> Thanks for the tip.

The best way I have to explain is to pedantically go through how I would try to 
understand why it is faster.  I hope this is something useful in this message, 
and nothing that is taken as condescending.

Thinking about memory usage and garbage collection in strict language like Java 
is tricky, and thinking about them in non-strict Haskell is another layer of 
consideration. ( But in this case it will be quite easy to understand from the 
code.)  I will look at the code:

1. See that sequence and sequence_ are exposed by Prelude
2. Since that is part of the Haskell 98 definition, google a copy of the 
"Haskell 98 report" at http://www.haskell.org/onlinereport/
3. Look at "8. Standard Prelude" at 
http://www.haskell.org/onlinereport/standard-prelude.html
4. Scroll down to sequence and sequence_ to see:

> sequence       :: Monad m => [m a] -> m [a] 
> sequence       =  foldr mcons (return [])
>                     where mcons p q = p >>= \x -> q >>= \y -> return (x:y)
> 
> 
> sequence_      :: Monad m => [m a] -> m () 
> sequence_      =  foldr (>>) (return ())

A more generally useful way to look up the actual ghc source code is:

1. In ghci do ":i sequence" to see it comes from Control.Monad
2. look at http://www.haskell.org/ghc/docs/latest/html/libraries/index.html
3. See "Control.Monad" is in the "base" packagage
4. Browse through http://haskell.org/ghc/ to "Developers(Wiki)" and "Getting the 
Sources" to http://hackage.haskell.org/trac/ghc/wiki/GhcDarcs
5. Follow the link to package "base" via http://darcs.haskell.org/packages/base/
6. Browse to "Control" and "Monad.hs" to 
http://darcs.haskell.org/packages/base/Control/Monad.hs
7. Scroll down to sequence and sequence_ to see:

> -- | Evaluate each action in the sequence from left to right,
> -- and collect the results.
> sequence       :: Monad m => [m a] -> m [a] 
> {-# INLINE sequence #-}
> sequence ms = foldr k (return []) ms
> 	    where
> 	      k m m' = do { x <- m; xs <- m'; return (x:xs) }
> 
> -- | Evaluate each action in the sequence from left to right,
> -- and ignore the results.
> sequence_        :: Monad m => [m a] -> m () 
> {-# INLINE sequence_ #-}
> sequence_ ms     =  foldr (>>) (return ()) ms

The implementation code is only 1 or 2 lines, and reveals it is just really 
useful shorthand for a right fold.

Right folds are notorious for being bad memory consumers when they are strict in 
the second argument of their accumulation function.  And indeed this is the 
problem in this case.

Looking at sequence_ first, since it is simpler:  It essentially says to put 
(>>) between all the elements of the list of IO actions which is equivalent to 
putting the actions one after another in "do" notation.  It never needs to 
remember the result of any of the actions, so the garbage collector will 
occasionally run and destroy the intermediate results.  Those intermediate 
results may pile up in memory as dead references, so the gc might clean them 
only after they (or something else) cause memory pressure.

Now look at sequence, and remember that the Monad m here is Monad IO.  The IO 
monad runs in a strict manner.  Consider " do { x <- m; xs <- m'; return (x:xs) 
}" which could have been written

sequence ms = foldr k (return []) ms
   where
     k m m' = do
       x <- m
       xs <- m'
       return (x:xs)

sequence [a,b,c] = foldr k (return []) [a,b,c]

can be expanded via the foldr definition and some syntactic sugar as

a `k` (b `k` (c `k` return []))

can be expanded via the `k` definition and some syntactic sugar as

do
   x <- a
   xs <- (b `k` (c `k` return []))
   return (x:xs)

So you can see the return value of a is x.  Then it goes and computes the rest 
of the sequence for b and c while holding onto the reference for x.  The "return 
(x:xs)" line is later and also refers to x, which means x is live instead of 
dead, so the garbage collector will not remove it.

The same analysis applies to the values returned by b and c.  All the 
intermediate values are live until the first `k` executes the "return" statement 
with all the values.  This is why the memory usage is maximal.

The problem was created by the IO Monad evaluating the b and c strictly once "xs 
<- (b `k` (c `k` return []))" was encountered.  The use of sequence with a 
different Monad which was lazy instead would have different evaluation order and 
memory usage.  In particular "Control.Monad.ST.Strict" and 
"Control.Monad.ST.Lazy" have opposite behaviors in this regard.

Other things I notice from the code:

sequence and sequence_ work with any Monad, which should make one curious about 
what they are good shorthand for in non-deterministic monads like List.  They 
are also just as handy in Maybe / Either / etc...

The Haddock formatted comments, which are where the documentation comes from:
http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Monad.html#v%3Asequence

The GHC specific comment "INLINE" is being used.  This reliably turns sequence 
and sequence_ into shorthand for their foldr definitions during compilation, 
allowing further improvements such as deforestation when possible.

-- 
Chris