[Haskell-cafe] RE: readFile and closing a file
dons at galois.com
Mon Sep 22 12:39:21 EDT 2008
> > On Wed, 17 Sep 2008, Mitchell, Neil wrote:
> >> I tend to use openFile, hGetContents, hClose - your initial readFile
> >> like call should be openFile/hGetContents, which gives you a lazy
> >> stream, and on a parse error call hClose.
> > I could use a function like
> > withReadFile :: FilePath -> (Handle -> IO a) -> IO a
> > withReadFile name action = bracket openFile hClose ...
> > Then, if 'action' fails, the file can be properly closed. However, there
> > is still a problem: Say, 'action' is a parser which produces a data
> > structure lazily. Then further processing of that data structure of type
> > 'a' may again stop before completing the whole structure, which would also
> > leave the file open. We have to force users to do all processing within
> > 'action' and to only return strict values. But how to do this?
> I used rnf from Control.Parallel.Strategies when dealing with a
> similar problem. Would it work in your case?
> To merge discussion from a related thread:
> IMO, the question is how much should a language/library prevent the
> user from shooting himself in the foot? The biggest problem with lazy
> IO, IMO, is that it presents several opportunities to do so. The
> three biggest causes I've dealt with are handle leaks, insufficiently
> evaluated data structures, and problems with garbage collection as in
> the naive 'mean xs = sum xs / length xs' implementation.
> There are some idioms that can help with the first two cases, namely
> the 'with*' paradigm and 'rnf', but the third problem requires that
> the programmer know how stuff works to avoid poor implementations.
> While that's not bad per se, in some cases I think it's far too easy
> for the unwitting, or even the slightly distracted, to get caught in
> I'm facing a design decision ATM related to this. I can use something
> like lazy bytestrings, in which the chunkiness and laziness is reified
> into the datastructure, or an Iterator-style fold for consuming data.
> The advantage of the former approach is that it's well understood by
> most users and has proven good performance, while on the downside I
> could see it easily leading to memory exhaustion. I think the problem
> with lazy bytestrings, in particular, is that the foldChunks is so
> well hidden from most consumers that it's easy to hold references that
> prevent consumed chunks from being reclaimed by the GC. When dealing
> with data in the hundreds of MBs, or GB range, this is a problem.
> An Enumerator, on the other hand, makes the fold explicit, so users
> are required to think about the best way to consume data. It's much
> harder to unintentionally hold references. This is quite appealing.
> Based on my own tests so far performance is certainly competitive.
> Assuming a good implementation, handle leaks can also be prevented.
> On the downside, it's a very poor model if random access is required,
> and users aren't as familiar with it, in addition to some of the
> questions Don raises.
Yes, I'm certain we can reach the performance of, or outperform, lazy
(cache-sized chunk) bytestrings using enumerators on chunks, but the
model is somewhat unfamiliar. Structuring the api such that people can
write programs in this style will be the challenge.
More information about the Haskell-Cafe