Histogram-building code (was: Re: Yet another weakly defined bug report)

Ketil Z. Malde ketil@ii.uib.no
18 Feb 2003 10:56:46 +0100

Just a quick status report, and to note a couple of lessons learned:

Things work adequately, as far as I can tell.  I can now process heaps
of data, without blowing up anything.  Appears to be faster than
spam-stat.el, at least, although I haven't measured.

I'm back to using "readFile" for file IO, and it works nicely, as long
as I make sure all the file is processed.  I think this is a good way
of processing large amounts of data (where the processing reduces the
data size), reading the entire file into memory strictly is quickly
going to be too costly (expanded to linked lists of unicode, ugh)

Don't trust finiteMap to evaluate anything.  I have evidence one of
the major space leaks was FM only evaluating the strings used as keys
to the point they were proved unique.  (Is this right?)  Strictifying
the strings helped a lot.

One question though, about hFlush.  I print out the status by
repeatedly putStr'ing "blah blah \r".  With NoBuffering set, it works,
but when following the putStr with 'hFlush stdout', it doesn't (only
outputs very sporadically.  I guess I'm misunderstanding the function
of hFlush, anybody care to elaborate?)

And a final lesson, unlike cockroaches, computer bugs hide in light as
well in the darkness.  One bug in the very trivial token parsing code
caused a lot of words that should have been ignored to be included. 

Thanks to everybody who helped out.

If I haven't seen further, it is by standing in the footprints of giants