[Haskell-beginners] Processing a list of files the Haskell way

Chaddaï Fouché chaddai.fouche at gmail.com
Sat Mar 10 13:48:22 CET 2012


On Sat, Mar 10, 2012 at 12:55 PM, Michael Schober <Micha-Schober at web.de> wrote:
> Hi everyone,
>
> I'm currently trying to solve a problem in which I have to process a long
> list of files, more specifically I want to compute MD5 checksums for all
> files.
>
> I have code which lists me all the files and holds it in the following data
> structure:
>
> data DirTree = FileNode FilePath | DirNode FilePath [DirTree]
>
> I tried the following:
>
> -- calculates MD5 sums for all files in a dirtree
> addChecksums :: DirTree -> IO [(DirTree,MD5Digest)]
> addChecksums dir = addChecksums' [dir]
>  where
>    addChecksums' :: [DirTree] -> IO [(DirTree,MD5Digest)]
>    addChecksums' [] = return []
>    addChecksums' (f@(FileNode fp):re) = do
>      bytes <- BL.readFile fp
>      rest <- addChecksums' re
>      return ((f,md5 bytes):rest)

You're not computing the md5 sums before you have done the same for
all other files in the directory... And since you're being lazy you
don't even compute it _at all_ before you ask for it leter in your
program.

If readFile wasn't lazy, you would need to keep all the contents of
those files in memory until after addChecksums is completely finished
(which would be a big problem in itself), but since readFile is lazy,
those file aren't read either until you need their content. But
they're still opened, so you get a lot of opened handle you don't
close, and opened handle are a limited resource in any OS so...

What you need to do is computing the md5 sums as soon as you see the
file and before you do anything else, so :

>    addChecksums' (f@(FileNode fp):re) = do
>      bytes <- BL.readFile fp
>      let !md5sum = md5 bytes
>      rest <- addChecksums' re
>      return ((f,md5sum):rest)

The ! before md5sum indicates that this let-binding should be
immediately computed rather than deferred until needed which is the
norm for let-binding. Don't forget to add {-# LANGUAGE BangPattern #-}
at the beginning of your file. Since the file is read to its end by
md5, the handle is automatically closed, so you shouldn't have the
same problem.

Note that you solution isn't very "functional-like", but rather
imperative. On the other hand, making it more functional in this
particular case come with its own brand of subtle difficulties.

-- 
Jedaï



More information about the Beginners mailing list