[Haskell-cafe] Lazy IO and closing of file handles

Thu Mar 15 05:54:33 EDT 2007

Ketil Malde:
> Perhaps this is an esoteric way, but I think the nicest approach is to 
> parse into a strict structure.  If you fully evaluate each Email (or 
> whatever structure you parse into), there will be no unevaluated thunks 
> linking to the file, and it will be closed.

Not necessarily so, since you are making assumptions about the
timeliness of garbage collection. I was similarly sceptical of Claus'
suggestion:

Claus Reinke:
> in order to keep the overall structure, one could move readFile backwards
> and parseEmail forwards in the pipeline, until the two meet. then make sure
> that parseEmail completely constructs the internal representation of each
> email, thereby keeping no implicit references to the external representation.

So here's a test. I don't have any big maildirs handy, so this is based
on the simple exercise of printing the first line of each of a large
number of files. First, the preamble.

> import Control.Exception (bracket)
> import System.Environment
> import System.IO

> main = do
>   t:n:fs <- getArgs
>   ([test0,test1,test2,test3] !! read t) (take (read n) $ cycle fs)

The following example corresponds to Pete's original program. As
expected, when called with a sufficiently large number of files, it
always results in file handle exhaustion without producing any output:

> test0 files = mapM readFile files >>= mapM_ (putStrLn.head.lines)

The next example, corresponds (I think) to Claus' suggestion, in which
the readFile and putStrLn are performed at the same point in the
pipeline. I found that sometimes this runs without error, but other
times it fails with file handle exhaustion. This seems to depend on the
mood of the garbage collector, or at least the external conditions in
which the garbage collector operates. It also appears to fail more
frequently for small files. Without any knowledge of garbage collector
internals, I'm guessing that this is because readFiles reads in 8K
chunks. For files significantly smaller than 8K, garbage collection
cycles are likely to be much less frequent, and therefore there is
greater likelihood of file handle exhaustion between GC cycles.

> test1 files = mapM_ doStuff files where
>   doStuff f = readFile f >>= putStrLn.head.lines

The third is similar to the second, except it adds strictness
annotations to force the file to be read to the end. As expected, this
saves me from file handle exhaustion, but it is grossly inefficient for
large files.

> test2 files = mapM_ doStuff files where
>   doStuff f = do
>     contents <- readFile f
>     putStrLn $ head $ lines contents
>     return $! force contents
>   force (x:xs) = force xs
>   force [] = ()

In the fourth example, I explicitly close the filehandle. This also
saves me from file handle exhaustion, but I must be carefull to force
everything I need to be read before returning. Returning a lazy
computation would be no good, as discovered in [1]. In this case,
putStrLn does all the forcing I need.

> test3 files = mapM_ bracketStuff files where
>   bracketStuff f = bracket (openFile f ReadMode) hClose doStuff
>   doStuff h = hGetContents h >>= putStrLn.head.lines

As Oleg points out in [2], all of the above have the problem that it is
impossible to tell the difference between a read error and end-of-file.
I had intended to write an example using explicitly sequenced I/O, but
Oleg has saved me the trouble with the post he made just now [3].

[1]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023189.html
[2]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html
[3]http://www.haskell.org/pipermail/haskell-cafe/2007-March/023523.html