[Haskell-cafe] Generating a random list

Sat Mar 1 16:43:38 EST 2008

The following is in ghci 6.8.2 with default options (e.g., default heap 
and stack). "G>" denotes the ghci prompt.

At some points ghci will use 500MB of memory. Be sure you have enough 
physical memory.

G> :m + Data.List System.Random
G> let f n = take n randoms (mkStdGen 0)) :: [Float]

I define f to take a parameter so that thunks are created fresh every 
time I use it.

G> sum (f 1000000)
*** Exception: stack overflow

This overflow is known, and its solution too:

G> foldl' (+) 0 (f 1000000)
499985.38

The more interesting part is when we also sort:

G> foldl' (+) 0 (sort (f 1000000))
*** Exception: stack overflow

At this point everyone hastens to assign blames according to his/her 
mental model. But some mental models are more mistaken than others. When 
a mental model is shown to be mistaken by another example, we have a 
name for the feeling that ensues: "my brain has exploded". And here is 
an example that likely causes your brain to explode (it also causes ghci 
to use 500MB of memory):

G> foldl' (+) 0 (sort (reverse (f 1000000)))
499992.0

(Don't worry about the different sum. We all know that re-ordering 
floating point numbers changes the sum.)

Some brains haven't exploded yet. They believe that reverse helps by 
forcing all 1000000 cons cells before sorting. To burst them, let's 
reverse twice:

G> foldl' (+) 0 (sort (reverse (reverse (f 1000000))))
*** Exception: stack overflow

You can also reverse 3 times, 4 times... In general, even number of 
times overflows, odd number of times works. It is now clear that order 
before sorting matters. But why should it?

I now give some hints and step down. I invite you to contemplate and 
discuss further.

* Suppose f 6 is [a,b,c,d,e,f]. What is the size of thunk f? Compare to 
the size of thunk a, b, etc. Also note how much is shared among them. If 
you want, the source code of System.Random is at

http://www.haskell.org/ghc/docs/latest/html/libraries/random/src/System-Random.html

If that is too complicated, a simpler model is available at

http://haskell.org/onlinereport/random.html

For our present purpose, the simpler model is accurate enough.

* Based on that knowledge, you're in a similar situation as

http://www.haskell.org/haskellwiki/Stack_overflow#Scans

Thus for example "last (f 1000000)" overflows, while "print (f 1000000)" 
works just fine (if you have the patience).

* Debug.Trace.trace is great for verifying relative order of evaluation. 
If I construct this list

[trace "0" a, trace "1" b, trace "2" c, ...]

and pass it to sort, I can see which item is forced in what order when 
sort compares items. This command does:

G> :m + Debug.Trace
G> sort (zipWith (trace . show) [0..] (f 6))

What do you see? What do you think? Bump up the number 6 to 10, 20... to 
verify what you guess.

* If you are interested in finding out in detail why sort does that, the 
source code is at

http://www.haskell.org/ghc/docs/latest/html/libraries/base/src/Data-List.html

Note the code of "merge":

merge cmp xs [] = xs
merge cmp [] ys = ys
merge cmp (x:xs) (y:ys) = ...

Suppose you re-order the first two lines, i.e., make it

merge cmp [] ys = ys
merge cmp xs [] = xs
merge cmp (x:xs) (y:ys) = ...

what will happen?

* Prove that the sorting algorithm is only responsible for O(n log n) 
time and O(log n) stack. Thus any extra stack pressure must come from 
thunks in the list items.