Yet another weakly defined bug report

Dean Herington heringto@cs.unc.edu
Fri, 14 Feb 2003 10:24:45 -0500


Yes, getting the right amount of strictness--and in the right places--can be
tricky.  For your program, it seems you should process each file completely
(reflecting its contents strictly in your histogram so the contents can be dropped
and the file descriptor closed) before moving on to the next file.  This would
reduce the number of file descriptors needed to one and limit memory requirements
to O(m), where m is the maximum size of a file.  In addition, you should process
each file incrementally, reducing the memory requirements to O(1).  Specific
comments follow.

"Ketil Z. Malde" wrote:

> Dean Herington <heringto@cs.unc.edu> writes:
>
> > "Ketil Z. Malde" wrote:
>
> >> -- | add data from a file to the histogram
> >> addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
> >> addFile fm name = do
> >>             x <- readFile name
> >>             return (addHist fm x)
>
> I changed this to read x strictly, but it turned out that wasn't quite
> enough.  See below.

Why read a file's contents strictly?  What's important is to tabulate its
contribution to the histogram strictly.  Perhaps

        return $! addHist fm x

might be enough to achieve that (*).

> >> -- | add data from all files in a directory to the histogram
> >> addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
> >> addDir fm dir = do
> >>           dc <- getDirectoryContents dir
> >>           fs <- filterM doesFileExist (map ((dir++"/")++) dc)
> >>           foldM addFile fm fs
>
> >> addHist :: FiniteMap String Int -> String -> FiniteMap String Int
> >> addHist fm = foldl add1 fm . words
> >>    where add1 f w = case lookupFM f w of
> >>                                Just n -> addToFM f w (n+1)
> >>                                Nothing -> addToFM f w 1

You should do the counting strictly:

        Just n -> case n+1 of n1 -> addToFM f w n1

Your current limitation on the size of a directory that your program can handle
may be due to these unevaluated thunks representing (n+1) calculations.

> > The string won't be evaluated until the finite map is demanded.  None of the
> > code you've shown so far makes that demand.
>
> [I'll leave my ramblings here, skip to the ==== for the solution]
>
> Right.  But if I do
>
>         f <- addFile emptyFM "foo"
>
> then no IO should be performed either, right?

Right, except that, as Simon M. mentioned, the file is opened so that any opening
exceptions are triggered.

>  And when I later do
>
>         print (head . fmToList f)
>
> the whole FM needs to be computed, no?

Perhaps.  You're only demanding the head of the list.  Conceivably, the FM logic
might be able to determine the lowest key/element pair without evaluating the
entire map.

> If this is unsound (too lazy) in any way, I don't see it.  How much
> memory can I expect an FM to consume, anyway?  I would expect
> something like
>
>         (log n * sizeof(branch node)) + n*sizeof(key+element)
>
> Which should be roughly equivalent to n*sizeof(key+element), which in
> my case (building a frequency count for words) should be roughly
> equivalent to the number of different words times their average
> length (times 8 bytes per character in a list).  Is this far off?
>
> Experiments show that doing x <- addDir emptyFM on a directory
> containing 13Mb of files, the process grows to about 400Mb.
> Surprisingly, asking for the head requires a lot more memory and takes
> considerable time.
>
> =====
>
> So, what's really happening?  Well, it seems the files were read
> strictly, but the FMs weren't constructed yet.  So the result was a
> lot of files residing in memory, stored as expansive [Char]s.
> Changing addFile to:
>
> > addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
> > addFile fm name = do
> >                 h <- openFile name ReadMode
> >                 x <- hGetContents h
> >                 let f = addHist fm x
> >                 hClose (f `seq` h) -- thanks to Peter Thiemann
> >                 return f

I find the above approach a bit risky, as you are closing the file after having
only shallowly demanded the result of addHist.  My earlier suggestion, return $!
addHist fm x, makes exactly the same shallow demand, but if that demand is
insufficient, loses performance but not correctness.  I would recommend letting
the runtime system close the file and just being careful to read to end of file
before moving on to the next file.

> With this, I can at least plough through the 13Mb of files
> comfortably.  After fixing a bug in my overlong word elimination code,
> but a directory with 50Mb is still out of reach.

When you get the strictness "right", you should suffer no such limit.

> What is the limit on open files, and why?  I think it'd be nice to
> just schedule a huge amount of IO operations, and have them all be
> performed when required (i.e. when the FM is first accessed).
> Apparently, my addDir took the trouble to open the files, but not
> generate the FMs -- any idea why?
>
> Finally, the nice thing about Haskell is that the frequency counting
> code was written and tested in less than five minutes.  The downside
> is that getting the IO and strictness right, took around five
> hours. :-/

(*) I've had problems myself with the use of finite maps without sufficient
strictness.  It seems that perhaps the FiniteMap interface should provide some
explicit support for strictness.  Anybody have thoughts on that?

-- Dean