[Haskell-cafe] problem with IO, strictness, and "let"

Albert Y. C. Lai trebla at vex.net
Fri Jul 13 18:35:32 EDT 2007

Brandon Michael Moore wrote:
> Calling hClose after hGetContents is the root of the problem, but this is
> a good example for understanding seq better too.

To further this end, I'll "take issue" :) with the final version that 
has been tested to work, and show that it still won't work.

First, the program in question:

import System.IO
import System.Environment

process_file :: FilePath -> IO ()
process_file filename =
     do h <- openFile filename ReadMode
        c <- hGetContents h
        cs <- return $! lines c
        hClose h
        putStrLn $ show $ length cs

It will give a wrong answer to a file large enough. The short 
explanation is that seq, or its friend $!, does not evaluate its 
argument (lines c) entirely; it only evaluates "so much". The jargon is 
"weak head normal form", but concretely here are some examples: for Int, 
it evaluates until you have an actual number, which is all good and 
expected; for lists, it evaluates only until the first cons cell emerges 
(let's say we know the list will be non-empty). It will not hunt down 
the rest of the list. (Moreover, it will not even hunt down what's in 
the cons cell, e.g., the details of the first item of the list. But this 
is not too important for now.)

It still happens to give the right answer to a file small enough, thanks 
to buffering.

So here is a chronicle of execution, with confusing details - confusing 
because two wrongs conspire to make a right, almost:

0. open file, hGetContents. Remember that block buffering with a pretty 
large buffer is the default.

1. $! evalutes lines c for the first cons cell. To do that, latent code 
(the jargon is "thunk") installed by hGetContents is invoked and it 
reads something. It is in block buffering mode, so it reads blockful. 
The first cons cell will only emerge when the first line break is found, 
so it reads blocks until a block contains a line break. But it does not 
read more blocks.

Whatever has been read will be accessible to cs. Maybe not immediately 
in the form of lists of strings. Part of it is already in that form, the 
other part is in the form of buffer content plus a thunk to convert the 
buffer to lists of strings just in time. That thunk intermingles code 
from the lines function and hGetContent. Perhaps you don't need to know 
that much. The bottomline is that cs has access to one or more blocks 
worth of stuff, which may or may not be the whole file. Exactly how much 
is defined by: as many blocks as to contain the first line break.

2. close file. Henceforth no further reading is possible. cs still has 
access to whatever has been done in the above step; it is already in 
memory and can't be lost. But cs has no access to whatever not in 
memory; it does not exist.

3. count the number of lines accessible to cs.

As examples here are some scenerios:

A. The whole file fits into the buffer. You will get the correct count.

B. Five lines plus a little bit more fit into the buffer. The answer is six.

C. The first line is very long, or the buffer is very small. The answer 
is one or two, depending on whether the line break falls in the middle 
or at the boundary of the buffer.

To test for these scenerios, you can fudge the buffer size and have fun:

process_file :: FilePath -> IO ()
process_file filename =
     do h <- openFile filename ReadMode
        hSetBuffering h (BlockBuffering (Just 20))
        c <- hGetContents h
        cs <- return $! lines c
        hClose h
        putStrLn $ show $ length cs

There are two conclusions you can draw:

For a task satisfied by a single pass, and the task traverses the whole 
file unconditionally: let go of control. Use hGetContents and don't 
bother to hClose yourself (it will be closed just in time).

For a task requiring several passes, and you want the whole file read 
"here and now": seq won't cut it. Some people use "return $! length c" 
for that. There are also other ways. Consider Data.ByteString.

What about a task satisfied by a single pass but it does not necessarily 
traverse the whole file? Automatic close won't kick in. You will hClose 
yourself but where to put it is a long story.

More information about the Haskell-Cafe mailing list