[Haskell-cafe] problem with IO, strictness, and "let"

Fri Jul 13 19:29:12 EDT 2007

Albert,

Thanks for the very detailed reply!  That's the great thing about this mailing list.

I find your description of seq somewhat disturbing.  Is this behavior documented in the API?  I 
can't find it there.  It suggests that perhaps there should be a 
really-truly-absolutely-I-mean-right-now-seq function that evaluates the first argument strictly no 
matter what (not that this should be something that gets used very frequently).  Or are there 
reasons why this is not feasible?

Sorry to belabor this.  Learning to think lazily is IMO one of the hardest aspects of learning Haskell.

Mike

Albert Y. C. Lai wrote:
> Brandon Michael Moore wrote:
>> Calling hClose after hGetContents is the root of the problem, but this is
>> a good example for understanding seq better too.
> 
> To further this end, I'll "take issue" :) with the final version that 
> has been tested to work, and show that it still won't work.
> 
> First, the program in question:
> 
> import System.IO
> import System.Environment
> 
> process_file :: FilePath -> IO ()
> process_file filename =
>     do h <- openFile filename ReadMode
>        c <- hGetContents h
>        cs <- return $! lines c
>        hClose h
>        putStrLn $ show $ length cs
> 
> It will give a wrong answer to a file large enough. The short 
> explanation is that seq, or its friend $!, does not evaluate its 
> argument (lines c) entirely; it only evaluates "so much". The jargon is 
> "weak head normal form", but concretely here are some examples: for Int, 
> it evaluates until you have an actual number, which is all good and 
> expected; for lists, it evaluates only until the first cons cell emerges 
> (let's say we know the list will be non-empty). It will not hunt down 
> the rest of the list. (Moreover, it will not even hunt down what's in 
> the cons cell, e.g., the details of the first item of the list. But this 
> is not too important for now.)
> 
> It still happens to give the right answer to a file small enough, thanks 
> to buffering.
> 
> So here is a chronicle of execution, with confusing details - confusing 
> because two wrongs conspire to make a right, almost:
> 
> 0. open file, hGetContents. Remember that block buffering with a pretty 
> large buffer is the default.
> 
> 1. $! evalutes lines c for the first cons cell. To do that, latent code 
> (the jargon is "thunk") installed by hGetContents is invoked and it 
> reads something. It is in block buffering mode, so it reads blockful. 
> The first cons cell will only emerge when the first line break is found, 
> so it reads blocks until a block contains a line break. But it does not 
> read more blocks.
> 
> Whatever has been read will be accessible to cs. Maybe not immediately 
> in the form of lists of strings. Part of it is already in that form, the 
> other part is in the form of buffer content plus a thunk to convert the 
> buffer to lists of strings just in time. That thunk intermingles code 
> from the lines function and hGetContent. Perhaps you don't need to know 
> that much. The bottomline is that cs has access to one or more blocks 
> worth of stuff, which may or may not be the whole file. Exactly how much 
> is defined by: as many blocks as to contain the first line break.
> 
> 2. close file. Henceforth no further reading is possible. cs still has 
> access to whatever has been done in the above step; it is already in 
> memory and can't be lost. But cs has no access to whatever not in 
> memory; it does not exist.
> 
> 3. count the number of lines accessible to cs.
> 
> As examples here are some scenerios:
> 
> A. The whole file fits into the buffer. You will get the correct count.
> 
> B. Five lines plus a little bit more fit into the buffer. The answer is 
> six.
> 
> C. The first line is very long, or the buffer is very small. The answer 
> is one or two, depending on whether the line break falls in the middle 
> or at the boundary of the buffer.
> 
> To test for these scenerios, you can fudge the buffer size and have fun:
> 
> process_file :: FilePath -> IO ()
> process_file filename =
>     do h <- openFile filename ReadMode
>        hSetBuffering h (BlockBuffering (Just 20))
>        c <- hGetContents h
>        cs <- return $! lines c
>        hClose h
>        putStrLn $ show $ length cs
> 
> There are two conclusions you can draw:
> 
> For a task satisfied by a single pass, and the task traverses the whole 
> file unconditionally: let go of control. Use hGetContents and don't 
> bother to hClose yourself (it will be closed just in time).
> 
> For a task requiring several passes, and you want the whole file read 
> "here and now": seq won't cut it. Some people use "return $! length c" 
> for that. There are also other ways. Consider Data.ByteString.
> 
> What about a task satisfied by a single pass but it does not necessarily 
> traverse the whole file? Automatic close won't kick in. You will hClose 
> yourself but where to put it is a long story.
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe