[Haskell-cafe] Lazy vs correct IO [Was: A round of golf]
oleg at okmij.org
oleg at okmij.org
Fri Sep 19 02:51:59 EDT 2008
Lennart Augustsson wrote
> main = do
> name:_ <- getArgs
> file <- readFile name
> print $ length $ lines file
Given the stance against top-level mutable variables, I have not
expected to see this Lazy IO code. After all, what could be more against
the spirit of Haskell than a `pure' function with observable side
effects. With Lazy IO, one indeed has to choose between correctness
and performance. The appearance of such code is especially strange
after the evidence of deadlocks with Lazy IO, presented on this list
less than a month ago. Let alone unpredictable resource usage and
reliance on finalizers to close files (forgetting that GHC does not
guarantee that finalizers will be run at all).
Is there an alternative?
-- Counting the lines in a file
import IterateeM
count_nl = liftI $ IE_cont (step 0)
where
step acc (Chunk str) = liftI $ IE_cont (step $! acc + count str)
step acc stream = liftI $ IE_done acc stream
count [] = 0
count ('\n':str) = succ $! count str
count (_:str) = count str
main = do
name:_ <- getArgs
IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl
print counter
The function count_nl could have been in the library, but I'm a
minimalist. It is written in a declarative rather than imperative
style, and one easily sees what it does. The above code as well as the
IterateeM library is Haskell98. It does not use any unsafe Haskell
functions whatsoever.
time wc -l /usr/share/dict/words
235882 /usr/share/dict/words
real 0m0.024s
user 0m0.022s
sys 0m0.000s
time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words
235882
real 0m0.141s
user 0m0.126s
sys 0m0.008s
To compare with lazy IO, the code using readFile gives
time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words
235882
real 0m0.297s
user 0m0.262s
sys 0m0.023s
So, choosing correctness does not mean losing in performance; in fact,
one may even gain.
Can enumerators compose? Well, we already seen the example above
(enum_file name >. enum_eof)
where the operation (>.)
e1 >. e2 = (==<<) e2 . e1
is a flipped composition if monadic bind were considered a flipped
application.
Here is a more interesting example: count words in all the files whose
names are given on the command line. There may be many files given,
thousands of them.
-- Count the stream. Again, could have been in the library
stream_count :: Monad m => IterateeGM el m Int
stream_count = liftI $ IE_cont (step 0)
where
step acc (Chunk []) = liftI $ IE_cont (step acc)
step acc (Chunk [_]) = liftI $ IE_cont (step $! succ acc)
step acc (Chunk ls) = liftI $ IE_cont (step $! acc + length ls)
step acc stream = liftI $ IE_done acc stream
main = do
names <- getArgs
let enumerators = foldr (\name -> (enum_file name >.)) enum_eof names
IE_done (IE_done counter _) _ <- unIM $ enumerators ==<<
enum_words stream_count
print counter
We notice that the composition of enumerators corresponds to the
`concatenation' of their sources. Declaratively, the meaning of the
above code is:
-- all the given files are concatenated
-- the resulting stream of characters is converted to a stream
of words
-- the stream of words is counted.
Operationally, the code does not open more than one file at a
time. More importantly, the code *never* reads more than 4096
characters at a time. A block of the file is read, split into words,
counted, and only then another chunk is read. After one file is done,
it is closed, and another file is processed. One can see that only one
file is being opened at a time by enabling traces. The processing is
fully incremental.
/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
3043421
16.99 real 15.83 user 0.71 sys
BTW, the program has counted words in 1169 files.
It is interesting to compare the above main function with the
corresponding lazy IO:
main'' = do
names <- getArgs
files <- mapM readFile names
print $ length $ words (concat files)
The number of lines is comparable. The execution is not. If we try to
run the lazy IO code, we get:
/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc
Wc: ./libraries/Win32/Graphics-Win32-GDI-Path.html:
openFile: resource exhausted (Too many open files)
More information about the Haskell-Cafe
mailing list