[Haskell-cafe] Generating a random list
Albert Y. C. Lai
trebla at vex.net
Sat Mar 1 16:43:38 EST 2008
The following is in ghci 6.8.2 with default options (e.g., default heap
and stack). "G>" denotes the ghci prompt.
At some points ghci will use 500MB of memory. Be sure you have enough
physical memory.
G> :m + Data.List System.Random
G> let f n = take n randoms (mkStdGen 0)) :: [Float]
I define f to take a parameter so that thunks are created fresh every
time I use it.
G> sum (f 1000000)
*** Exception: stack overflow
This overflow is known, and its solution too:
G> foldl' (+) 0 (f 1000000)
499985.38
The more interesting part is when we also sort:
G> foldl' (+) 0 (sort (f 1000000))
*** Exception: stack overflow
At this point everyone hastens to assign blames according to his/her
mental model. But some mental models are more mistaken than others. When
a mental model is shown to be mistaken by another example, we have a
name for the feeling that ensues: "my brain has exploded". And here is
an example that likely causes your brain to explode (it also causes ghci
to use 500MB of memory):
G> foldl' (+) 0 (sort (reverse (f 1000000)))
499992.0
(Don't worry about the different sum. We all know that re-ordering
floating point numbers changes the sum.)
Some brains haven't exploded yet. They believe that reverse helps by
forcing all 1000000 cons cells before sorting. To burst them, let's
reverse twice:
G> foldl' (+) 0 (sort (reverse (reverse (f 1000000))))
*** Exception: stack overflow
You can also reverse 3 times, 4 times... In general, even number of
times overflows, odd number of times works. It is now clear that order
before sorting matters. But why should it?
I now give some hints and step down. I invite you to contemplate and
discuss further.
* Suppose f 6 is [a,b,c,d,e,f]. What is the size of thunk f? Compare to
the size of thunk a, b, etc. Also note how much is shared among them. If
you want, the source code of System.Random is at
http://www.haskell.org/ghc/docs/latest/html/libraries/random/src/System-Random.html
If that is too complicated, a simpler model is available at
http://haskell.org/onlinereport/random.html
For our present purpose, the simpler model is accurate enough.
* Based on that knowledge, you're in a similar situation as
http://www.haskell.org/haskellwiki/Stack_overflow#Scans
Thus for example "last (f 1000000)" overflows, while "print (f 1000000)"
works just fine (if you have the patience).
* Debug.Trace.trace is great for verifying relative order of evaluation.
If I construct this list
[trace "0" a, trace "1" b, trace "2" c, ...]
and pass it to sort, I can see which item is forced in what order when
sort compares items. This command does:
G> :m + Debug.Trace
G> sort (zipWith (trace . show) [0..] (f 6))
What do you see? What do you think? Bump up the number 6 to 10, 20... to
verify what you guess.
* If you are interested in finding out in detail why sort does that, the
source code is at
http://www.haskell.org/ghc/docs/latest/html/libraries/base/src/Data-List.html
Note the code of "merge":
merge cmp xs [] = xs
merge cmp [] ys = ys
merge cmp (x:xs) (y:ys) = ...
Suppose you re-order the first two lines, i.e., make it
merge cmp [] ys = ys
merge cmp xs [] = xs
merge cmp (x:xs) (y:ys) = ...
what will happen?
* Prove that the sorting algorithm is only responsible for O(n log n)
time and O(log n) stack. Thus any extra stack pressure must come from
thunks in the list items.
More information about the Haskell-Cafe
mailing list