Yet another weakly defined bug report

Ketil Z. Malde ketil@ii.uib.no
14 Feb 2003 09:50:07 +0100


Dean Herington <heringto@cs.unc.edu> writes:

> "Ketil Z. Malde" wrote:

>> -- | add data from a file to the histogram
>> addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
>> addFile fm name = do
>>             x <- readFile name
>>             return (addHist fm x)

I changed this to read x strictly, but it turned out that wasn't quite
enough.  See below.

>> -- | add data from all files in a directory to the histogram
>> addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
>> addDir fm dir = do
>>           dc <- getDirectoryContents dir
>>           fs <- filterM doesFileExist (map ((dir++"/")++) dc)
>>           foldM addFile fm fs

>> addHist :: FiniteMap String Int -> String -> FiniteMap String Int
>> addHist fm = foldl add1 fm . words
>>    where add1 f w = case lookupFM f w of
>>                                Just n -> addToFM f w (n+1)
>>                                Nothing -> addToFM f w 1

> The string won't be evaluated until the finite map is demanded.  None of the
> code you've shown so far makes that demand.

[I'll leave my ramblings here, skip to the ==== for the solution]

Right.  But if I do

        f <- addFile emptyFM "foo"
        
then no IO should be performed either, right?  And when I later do

        print (head . fmToList f)

the whole FM needs to be computed, no?

If this is unsound (too lazy) in any way, I don't see it.  How much
memory can I expect an FM to consume, anyway?  I would expect
something like 

        (log n * sizeof(branch node)) + n*sizeof(key+element)

Which should be roughly equivalent to n*sizeof(key+element), which in
my case (building a frequency count for words) should be roughly
equivalent to the number of different words times their average
length (times 8 bytes per character in a list).  Is this far off?

Experiments show that doing x <- addDir emptyFM on a directory
containing 13Mb of files, the process grows to about 400Mb.
Surprisingly, asking for the head requires a lot more memory and takes
considerable time.

=====

So, what's really happening?  Well, it seems the files were read
strictly, but the FMs weren't constructed yet.  So the result was a
lot of files residing in memory, stored as expansive [Char]s.
Changing addFile to:

> addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int)
> addFile fm name = do
>		  h <- openFile name ReadMode
>		  x <- hGetContents h
>		  let f = addHist fm x 
>		  hClose (f `seq` h) -- thanks to Peter Thiemann
>		  return f


With this, I can at least plough through the 13Mb of files
comfortably.  After fixing a bug in my overlong word elimination code,
but a directory with 50Mb is still out of reach.

What is the limit on open files, and why?  I think it'd be nice to
just schedule a huge amount of IO operations, and have them all be
performed when required (i.e. when the FM is first accessed).
Apparently, my addDir took the trouble to open the files, but not
generate the FMs -- any idea why?

Finally, the nice thing about Haskell is that the frequency counting
code was written and tested in less than five minutes.  The downside
is that getting the IO and strictness right, took around five
hours. :-/ 

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants