[Haskell-cafe] Lazy vs correct IO [Was: A round of golf]

Fri Sep 19 02:51:59 EDT 2008

Lennart Augustsson wrote

> main = do
>    name:_ <- getArgs
>    file <- readFile name
>    print $ length $ lines file

Given the stance against top-level mutable variables, I have not
expected to see this Lazy IO code. After all, what could be more against
the spirit of Haskell than a `pure' function with observable side
effects. With Lazy IO, one indeed has to choose between correctness
and performance. The appearance of such code is especially strange
after the evidence of deadlocks with Lazy IO, presented on this list
less than a month ago. Let alone unpredictable resource usage and
reliance on finalizers to close files (forgetting that GHC does not
guarantee that finalizers will be run at all).

Is there an alternative?

-- Counting the lines in a file
import IterateeM

count_nl = liftI $ IE_cont (step 0)
 where
 step acc (Chunk str)  = liftI $ IE_cont (step $! acc + count str)
 step acc stream       = liftI $ IE_done acc stream
 count [] = 0
 count ('\n':str) = succ $! count str
 count (_:str) = count str

main = do
   name:_ <- getArgs
   IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl
   print counter

The function count_nl could have been in the library, but I'm a
minimalist. It is written in a declarative rather than imperative
style, and one easily sees what it does. The above code as well as the
IterateeM library is Haskell98. It does not use any unsafe Haskell
functions whatsoever.

time wc -l /usr/share/dict/words 
  235882 /usr/share/dict/words

real    0m0.024s
user    0m0.022s
sys     0m0.000s

time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 
235882

real    0m0.141s
user    0m0.126s
sys     0m0.008s

To compare with lazy IO, the code using readFile gives

time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 
235882

real    0m0.297s
user    0m0.262s
sys     0m0.023s

So, choosing correctness does not mean losing in performance; in fact,
one may even gain.

Can enumerators compose? Well, we already seen the example above
	(enum_file name >. enum_eof)
where the operation (>.) 
	e1 >. e2 = (==<<) e2 . e1
is a flipped composition if monadic bind were considered a flipped
application.

Here is a more interesting example: count words in all the files whose
names are given on the command line. There may be many files given,
thousands of them.

-- Count the stream. Again, could have been in the library
stream_count :: Monad m => IterateeGM el m Int
stream_count = liftI $ IE_cont (step 0)
 where
 step acc (Chunk [])  = liftI $ IE_cont (step acc)
 step acc (Chunk [_]) = liftI $ IE_cont (step $! succ acc)
 step acc (Chunk ls)  = liftI $ IE_cont (step $! acc + length ls)
 step acc stream      = liftI $ IE_done acc stream

main = do
   names <- getArgs
   let enumerators = foldr (\name -> (enum_file name >.)) enum_eof names
   IE_done (IE_done counter _) _ <- unIM $ enumerators ==<<
		                             enum_words stream_count
   print counter

We notice that the composition of enumerators corresponds to the
`concatenation' of their sources. Declaratively, the meaning of the
above code is:
	-- all the given files are concatenated
	-- the resulting stream of characters is converted to a stream
of words
	-- the stream of words is counted.

Operationally, the code does not open more than one file at a
time. More importantly, the code *never* reads more than 4096
characters at a time. A block of the file is read, split into words,
counted, and only then another chunk is read. After one file is done,
it is closed, and another file is processed. One can see that only one
file is being opened at a time by enabling traces. The processing is
fully incremental.

/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
3043421
       16.99 real        15.83 user         0.71 sys

BTW, the program has counted words in 1169 files.

It is interesting to compare the above main function with the
corresponding lazy IO:

main'' = do
   names <- getArgs
   files <- mapM readFile names
   print $ length $ words (concat files)

The number of lines is comparable. The execution is not. If we try to
run the lazy IO code, we get:

/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
Wc: ./libraries/Win32/Graphics-Win32-GDI-Path.html: 
   openFile: resource exhausted (Too many open files)