[Haskell-cafe] RE: readFile and closing a file

Mon Sep 22 12:39:21 EDT 2008

jwlato:
> > On Wed, 17 Sep 2008, Mitchell, Neil wrote:
> >
> >> I tend to use openFile, hGetContents, hClose - your initial readFile
> >> like call should be openFile/hGetContents, which gives you a lazy
> >> stream, and on a parse error call hClose.
> >
> > I could use a function like
> >   withReadFile :: FilePath -> (Handle -> IO a) -> IO a
> >   withReadFile name action = bracket openFile hClose ...
> >
> > Then, if 'action' fails, the file can be properly closed. However, there
> > is still a problem: Say, 'action' is a parser which produces a data
> > structure lazily. Then further processing of that data structure of type
> > 'a' may again stop before completing the whole structure, which would also
> > leave the file open. We have to force users to do all processing within
> > 'action' and to only return strict values. But how to do this?
> 
> I used rnf from Control.Parallel.Strategies when dealing with a
> similar problem.  Would it work in your case?
> 
> To merge discussion from a related thread:
> 
> IMO, the question is how much should a language/library prevent the
> user from shooting himself in the foot?  The biggest problem with lazy
> IO, IMO, is that it presents several opportunities to do so.  The
> three biggest causes I've dealt with are handle leaks, insufficiently
> evaluated data structures, and problems with garbage collection as in
> the naive 'mean xs = sum xs / length xs' implementation.
> 
> There are some idioms that can help with the first two cases, namely
> the 'with*' paradigm and 'rnf', but the third problem requires that
> the programmer know how stuff works to avoid poor implementations.
> While that's not bad per se, in some cases I think it's far too easy
> for the unwitting, or even the slightly distracted, to get caught in
> traps.
> 
> I'm facing a design decision ATM related to this.  I can use something
> like lazy bytestrings, in which the chunkiness and laziness is reified
> into the datastructure, or an Iterator-style fold for consuming data.
> The advantage of the former approach is that it's well understood by
> most users and has proven good performance, while on the downside I
> could see it easily leading to memory exhaustion.  I think the problem
> with lazy bytestrings, in particular, is that the foldChunks is so
> well hidden from most consumers that it's easy to hold references that
> prevent consumed chunks from being reclaimed by the GC.  When dealing
> with data in the hundreds of MBs, or GB range, this is a problem.
> 
> An Enumerator, on the other hand, makes the fold explicit, so users
> are required to think about the best way to consume data.  It's much
> harder to unintentionally hold references.  This is quite appealing.
> Based on my own tests so far performance is certainly competitive.
> Assuming a good implementation, handle leaks can also be prevented.
> On the downside, it's a very poor model if random access is required,
> and users aren't as familiar with it, in addition to some of the
> questions Don raises.

Yes, I'm certain we can reach the performance of, or outperform, lazy
(cache-sized chunk) bytestrings using enumerators on chunks, but the
model is somewhat unfamiliar. Structuring the api such that people can
write programs in this style will be the challenge.

-- Don