[Haskell-cafe] Lazy vs correct IO [Was: A round of golf]

Fri Sep 19 11:30:34 EDT 2008

I agree that lazy IO is a can with some worms in it.  But it's not that strange.
The readFile operation is in the IO monad, so it has an effect on the world.
This effect is not finished when readFile returns, and from the world
point of view
it's not entirely deterministic.

On Fri, Sep 19, 2008 at 7:51 AM,  <oleg at okmij.org> wrote:
>
> Lennart Augustsson wrote
>
>> main = do
>>    name:_ <- getArgs
>>    file <- readFile name
>>    print $ length $ lines file
>
> Given the stance against top-level mutable variables, I have not
> expected to see this Lazy IO code. After all, what could be more against
> the spirit of Haskell than a `pure' function with observable side
> effects. With Lazy IO, one indeed has to choose between correctness
> and performance. The appearance of such code is especially strange
> after the evidence of deadlocks with Lazy IO, presented on this list
> less than a month ago. Let alone unpredictable resource usage and
> reliance on finalizers to close files (forgetting that GHC does not
> guarantee that finalizers will be run at all).
>
> Is there an alternative?
>
> -- Counting the lines in a file
> import IterateeM
>
> count_nl = liftI $ IE_cont (step 0)
>  where
>  step acc (Chunk str)  = liftI $ IE_cont (step $! acc + count str)
>  step acc stream       = liftI $ IE_done acc stream
>  count [] = 0
>  count ('\n':str) = succ $! count str
>  count (_:str) = count str
>
> main = do
>   name:_ <- getArgs
>   IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl
>   print counter
>
>
> The function count_nl could have been in the library, but I'm a
> minimalist. It is written in a declarative rather than imperative
> style, and one easily sees what it does. The above code as well as the
> IterateeM library is Haskell98. It does not use any unsafe Haskell
> functions whatsoever.
>
> time wc -l /usr/share/dict/words
>  235882 /usr/share/dict/words
>
> real    0m0.024s
> user    0m0.022s
> sys     0m0.000s
>
> time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words
> 235882
>
> real    0m0.141s
> user    0m0.126s
> sys     0m0.008s
>
> To compare with lazy IO, the code using readFile gives
>
> time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words
> 235882
>
> real    0m0.297s
> user    0m0.262s
> sys     0m0.023s
>
> So, choosing correctness does not mean losing in performance; in fact,
> one may even gain.
>
> Can enumerators compose? Well, we already seen the example above
>        (enum_file name >. enum_eof)
> where the operation (>.)
>        e1 >. e2 = (==<<) e2 . e1
> is a flipped composition if monadic bind were considered a flipped
> application.
>
>
> Here is a more interesting example: count words in all the files whose
> names are given on the command line. There may be many files given,
> thousands of them.
>
> -- Count the stream. Again, could have been in the library
> stream_count :: Monad m => IterateeGM el m Int
> stream_count = liftI $ IE_cont (step 0)
>  where
>  step acc (Chunk [])  = liftI $ IE_cont (step acc)
>  step acc (Chunk [_]) = liftI $ IE_cont (step $! succ acc)
>  step acc (Chunk ls)  = liftI $ IE_cont (step $! acc + length ls)
>  step acc stream      = liftI $ IE_done acc stream
>
>
> main = do
>   names <- getArgs
>   let enumerators = foldr (\name -> (enum_file name >.)) enum_eof names
>   IE_done (IE_done counter _) _ <- unIM $ enumerators ==<<
>                                             enum_words stream_count
>   print counter
>
> We notice that the composition of enumerators corresponds to the
> `concatenation' of their sources. Declaratively, the meaning of the
> above code is:
>        -- all the given files are concatenated
>        -- the resulting stream of characters is converted to a stream
> of words
>        -- the stream of words is counted.
>
> Operationally, the code does not open more than one file at a
> time. More importantly, the code *never* reads more than 4096
> characters at a time. A block of the file is read, split into words,
> counted, and only then another chunk is read. After one file is done,
> it is closed, and another file is processed. One can see that only one
> file is being opened at a time by enabling traces. The processing is
> fully incremental.
>
>
> /usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
> 3043421
>       16.99 real        15.83 user         0.71 sys
>
> BTW, the program has counted words in 1169 files.
>
> It is interesting to compare the above main function with the
> corresponding lazy IO:
>
> main'' = do
>   names <- getArgs
>   files <- mapM readFile names
>   print $ length $ words (concat files)
>
> The number of lines is comparable. The execution is not. If we try to
> run the lazy IO code, we get:
>
> /usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
> Wc: ./libraries/Win32/Graphics-Win32-GDI-Path.html:
>   openFile: resource exhausted (Too many open files)
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>