[Haskell-beginners] Re: Understanding Monadic I/O

Fri Jan 15 05:48:20 EST 2010

Stephen Blackheath wrote:
> LAZY EVALUATION (AS IMPLEMENTED BY GHC)

Thank you for this marvelous explanation!

I do have a few comments, however, in particular concerning nomenclature
and mental model.

> A value can either be a literal value, or it can be a thunk.  For
> example, 1 + 1 is a thunk, and assuming the compiler hasn't optimized
> it, when it's demanded, the thunk is replaced in memory by a literal
> value '2', and the original thunk can be garbage collected.

The standard nomenclature is *expression*, both  1 + 1  and  2  are
expressions. The difference is that  2  is in *normal form*, whereas  1
+ 1  is not fully evaluated yet. The standard model of evaluating
expression in Haskell is called *graph reduction*, you can find some
preliminary material in the wikibook

  http://en.wikibooks.org/wiki/Haskell/Graph_reduction

"Thunk" traditionally refers to the machine representation of
unevaluated expressions, usually as a pointer and usually in the context
of discussing low level details like indirections and overwriting.

I don't use the concept of "thunk" when reasoning about lazy evaluation
because it has lots of unnecessary mental baggage, for example it must
be a pointer that points to something. This is fine when discussing how
GHC implements lazy evaluation, but it's not necessary when discussing
how lazy evaluation itself works.

Also, "thunk" doesn't distinguish between the different normal forms. In
Haskell, the possible normal forms are *weak normal form* and *weak head
normal form* (WHNF).

> Now let's say we have a list 'let xs = [1..1000000]' which (due to
> previous IO) has become fully evaluated and therefore takes up memory.
> Later we get the expression 'length xs'.  This is a thunk, and this
> thunk contains a reference to xs, thus preventing it being garbage
> collected.  Its value is just an integer, but until it's evaluated, it
> takes up a large amount of space because of the reference to xs.  This
> is an example of a space leak.

It is much clearer here to say that  xs  is the expression

  xs = 1 : 2 : 3 : 4 : 5 : ... : 1000000 : []

Clearly, this takes a lot of space to write down and so does

  n = length (1 : 2 : 3 : ... : 1000000 : [])

Compare this to the expression

  [1..1000000] = enumFromTo 1 1000000

which takes just few bytes to write down.

Of course, the normal form of  n  is  1000000  which does not take much
space. Hence, it is desirable to evaluate  n  to WHNF as soon possible,
unless it is completely unneeded.

> Another example:  'length [1..1000000]' is a little different.  It's a
> thunk (length ..) referring to a thunk ([1..1000000]). It takes up only
> a tiny bit of memory, since [1..1000000] is really just a recipe to
> create a list, not actually a list.

This is where the notion of "thunk" becomes unwieldy. The expression
[1..1000000] is actually a list, just not in normal form.

> The space behaviour of this
> expression is good, because [1..1000000] will be lazily created as
> 'length' consumes it, and the numbers are discarded as it goes.

It would be more precise to say that "the space usage while evaluating
the expression to normal form is good".

> Now we come to constructors.  A newtype has no effect on laziness, e.g.
> 
> newtype MyNumber = MyNumber Int
> 
> This is strict and exactly equivalent to an Int.  However, constructors
> created with data, the list constructor : and tuples (x,y) are all lazy.

The standard terminology is that data types are only evaluated to weak
head normal form, i.e. only to the outermost constructor.

Concerning newtypes, there is a subtle difference between a newtype

   newtype MyNumber = MyNumber Int

and a data type with a strict field

   data    MyNumber = MyNumber !Int

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com