[Haskell-cafe] problem with IO, strictness, and "let"
Albert Y. C. Lai
trebla at vex.net
Fri Jul 13 18:35:32 EDT 2007
Brandon Michael Moore wrote:
> Calling hClose after hGetContents is the root of the problem, but this is
> a good example for understanding seq better too.
To further this end, I'll "take issue" :) with the final version that
has been tested to work, and show that it still won't work.
First, the program in question:
import System.IO
import System.Environment
process_file :: FilePath -> IO ()
process_file filename =
do h <- openFile filename ReadMode
c <- hGetContents h
cs <- return $! lines c
hClose h
putStrLn $ show $ length cs
It will give a wrong answer to a file large enough. The short
explanation is that seq, or its friend $!, does not evaluate its
argument (lines c) entirely; it only evaluates "so much". The jargon is
"weak head normal form", but concretely here are some examples: for Int,
it evaluates until you have an actual number, which is all good and
expected; for lists, it evaluates only until the first cons cell emerges
(let's say we know the list will be non-empty). It will not hunt down
the rest of the list. (Moreover, it will not even hunt down what's in
the cons cell, e.g., the details of the first item of the list. But this
is not too important for now.)
It still happens to give the right answer to a file small enough, thanks
to buffering.
So here is a chronicle of execution, with confusing details - confusing
because two wrongs conspire to make a right, almost:
0. open file, hGetContents. Remember that block buffering with a pretty
large buffer is the default.
1. $! evalutes lines c for the first cons cell. To do that, latent code
(the jargon is "thunk") installed by hGetContents is invoked and it
reads something. It is in block buffering mode, so it reads blockful.
The first cons cell will only emerge when the first line break is found,
so it reads blocks until a block contains a line break. But it does not
read more blocks.
Whatever has been read will be accessible to cs. Maybe not immediately
in the form of lists of strings. Part of it is already in that form, the
other part is in the form of buffer content plus a thunk to convert the
buffer to lists of strings just in time. That thunk intermingles code
from the lines function and hGetContent. Perhaps you don't need to know
that much. The bottomline is that cs has access to one or more blocks
worth of stuff, which may or may not be the whole file. Exactly how much
is defined by: as many blocks as to contain the first line break.
2. close file. Henceforth no further reading is possible. cs still has
access to whatever has been done in the above step; it is already in
memory and can't be lost. But cs has no access to whatever not in
memory; it does not exist.
3. count the number of lines accessible to cs.
As examples here are some scenerios:
A. The whole file fits into the buffer. You will get the correct count.
B. Five lines plus a little bit more fit into the buffer. The answer is six.
C. The first line is very long, or the buffer is very small. The answer
is one or two, depending on whether the line break falls in the middle
or at the boundary of the buffer.
To test for these scenerios, you can fudge the buffer size and have fun:
process_file :: FilePath -> IO ()
process_file filename =
do h <- openFile filename ReadMode
hSetBuffering h (BlockBuffering (Just 20))
c <- hGetContents h
cs <- return $! lines c
hClose h
putStrLn $ show $ length cs
There are two conclusions you can draw:
For a task satisfied by a single pass, and the task traverses the whole
file unconditionally: let go of control. Use hGetContents and don't
bother to hClose yourself (it will be closed just in time).
For a task requiring several passes, and you want the whole file read
"here and now": seq won't cut it. Some people use "return $! length c"
for that. There are also other ways. Consider Data.ByteString.
What about a task satisfied by a single pass but it does not necessarily
traverse the whole file? Automatic close won't kick in. You will hClose
yourself but where to put it is a long story.
More information about the Haskell-Cafe
mailing list