[Haskell-cafe] Lazy IO and closing of file handles

Wed Mar 14 20:06:37 EDT 2007

Pete,

>     mapM fileContentsOfDirectory >>=
>     mapM_ print . threadEmails . map parseEmail . concat

By using the IO monad you've /scheduled/ your first 'print' to occur after
your last 'readFile', so every file is opened before the first file is read.

I've come across the same problem and would also be interested in a more
elegant solution.  The problem is that your program assumes a readonly
filesystem, but the compiler doesn't know that.  For all it knows, the
'print' after your first parse may be overwriting the contents of your
second file.

What we need is a library for a readonly filesystem.  That is, all the same
functions but pure.  I believe you could make this readonly library by
wrapping each readonly I/O function with 'unsafeInterleaveIO' or
'unsafePerformIO', but I don't really understand the consequences of using
'unsafe' calls, so I avoided it myself.

Until that's figured out, one solution is as Don suggested, when you read a
file, read the whole file at once so that the runtime can close the file
handle.  This lets you leave the rest of your code alone, but if your files
are too big, as they were for me, this will blow your heap!

My solution (code below) was to rearrange my I/O calls so that each print is
executed immediately after each readFile.  This takes away the modularity
you're looking for and assumes you can process one file at time, but without
a safe readonly filesystem library and files too big to be read all at once,
I think this is about the best we can do for now.  Hopefully others will
disagree!

main = do
   [dir] <- getArgs
   paths <- getFilePaths dir
   mapM_ (either print prettyPrint `oM` parse) paths

getFilePaths = return . flatten `oM` getFileTree

-- Given a path, this returns a tree
getFileTree :: FilePath -> IO (Tree FilePath)
getFileTree = unfoldTreeM childPaths
   where
      childPaths dir = do
         fs <- if' (getDirectoryContents dir) (return []) =<<
doesDirectoryExist dir
         return (dir, [dir ++ "/" ++ p | p <- fs, head p /= '.'])

infixl 8 `oM`
(a `oM` b) x = b x >>= a

if' t f b = if b then t else f

Thanks,
Greg

On 3/14/07, Pete Kazmier <pete-expires-20070513 at kazmier.com> wrote:
>
> When using readFile to process a large number of files, I am exceeding
> the resource limits for the maximum number of open file descriptors on
> my system.  How can I enhance my program to deal with this situation
> without making significant changes?
>
> My program takes one or more directories that contain email messages,
> stored one per file, and prints a list of all the email threads.  Here
> is a snippet of output:
>
>   New addition to the Kazmier family
>     Casey Kazmier
>
>   Memoization in Erlang?
>     Thomas Johnsson
>     Ulf Wiger \(AL/EAB\)
>
> As a newcomer to Haskell, I am intrigued by lazy evaluation and how it
> can influence one's designs.  With that said, I wrote the program as a
> sequence of list manipulations which seemed quite natural to do in
> Haskell starting with reading the contents of the each file.  Here is
> the algorithm at the high level:
>
>   1. Read contents of all files returning a list of Strings
>   2. Map over the list and parse each String as an Email
>   3. Sort the list of Emails
>   4. Group the list of Emails by Subject
>   5. Map over the grouped list to create a list of Threads
>   6. Finally, print the list of Threads
>
> It is my understanding that, as a result of lazy IO, the entire file
> does not need to be read into memory because parseEmail only inspects
> the topmost portion of the email (its headers), which is a key part of
> my design as some of the files can be quite large. Unfortunately, as
> soon as I run this program on a directory with more than 1024 files,
> GHC craps out on me due to resource limits.  It seems that the handles
> opened by readFile remain open.  Would this be common across all
> Haskell implementations?
>
> How do I go about fixing this without making a significant number of
> changes to my program?  Did I make a mistake in steps 1 and 2 above?
> Should I have read and parsed a single file at a time, and then move
> on to the next?
>
> I'd appreciate any other comments on the program as well.  I feel this
> is the best example of Haskell code that I have written.  Compared to
> the first version of this program I wrote a few months ago, this is a
> hundred times better.
>
> Here is the program:
>
> > module Main where
> >
> > import Control.Monad (filterM, liftM)
> > import Data.List
> > import Data.Maybe
> > import System.Directory
> > import System.Environment
> >
> > type From    = String
> > type Subject = String
> > data Email   = Email {from :: From, subject :: Subject} deriving Show
> > data Thread  = Thread [Email]
> >
> > instance Show Thread where
> >     show (Thread emails@(e:es)) = title ++ senders
> >         where
> >           title   = newline . bolder . subject $ e
> >           sender  = newline . indent . from
> >           senders = concatMap sender emails
> >           newline = (++ "\n")
> >           indent  = ("  " ++)
> >           bolder  = ("\27[0;32;40m" ++) . (++ "\27[0m")
> >
> > main =
> >     getArgs                      >>=
> >     mapM fileContentsOfDirectory >>=
> >     mapM_ print . threadEmails . map parseEmail . concat
> >
> > fileContentsOfDirectory :: FilePath -> IO [String]
> > fileContentsOfDirectory dir =
> >     setCurrentDirectory  dir >>
> >     getDirectoryContents dir >>=
> >     filterM doesFileExist    >>=  -- ignore directories
> >     mapM readFile
> >
> > parseEmail :: String -> Email
> > parseEmail text =
> >     Email (getHeader "From") (getHeader "Subject")
> >         where
> >           getHeader = fromMaybe "N/A" . flip lookup headers
> >           headers   = concatMap mkassoc . takeWhile (/="") $ lines text
> >           mkassoc s = case findIndex (==':') s of
> >                         Just n  -> [(take n s, drop (n+2) s)]
> >                         Nothing -> []
> >
> > threadEmails :: [Email] -> [Thread]
> > threadEmails =
> >     map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare)
> >         where
> >           fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd
> (subject e')
> >           stripReFwd    = stripSpaces . reverse . stripToColon . reverse
> >           stripSpaces   = dropWhile (==' ')
> >           stripToColon  = takeWhile (/=':')
>
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20070314/efab0f02/attachment-0001.htm