[Haskell-beginners] problem with System.Directory.Tree

Drew Haven drew.haven at gmail.com
Mon Jun 7 19:15:21 EDT 2010


I did something similar where I built up an md5sum of all the files in
a directory for comparing whether two directories were identical (I
was cleaning up some server storage).  One difference is that I only
read the first 4096 bytes of the file because if files are going to
differ they will likely differ in those bytes (and definitely would in
my case) and that is the default page read size is I recall, so even
if you use hGet handle 512, the system still reads 4192 bytes into
memory anyway, so why not use them.

I think I had a similar problem to yours with open file handles until
i used `withFile` from System.IO.  This handy function took care of
closing up file resources for me so I wouldn't have a ton of open file
handles.  My getFileHash function is as follows:

getFileHash :: FilePath -> IO (Maybe String)
getFileHash path =
   (do
       contents <- withFile path ReadMode (\h -> hGet h 4096)
       return . Just $! md5sum contents)
   `catch` (\e -> printFileError e
               >> return Nothing)

printFileError is just a function for printing out pretty errors
related to files.

You can see that it reads some contents of the file through withFile
and then md5sums them.  I have the $! to force evaluation so it will
compute as we go, otherwise it builds a huge tree of sums waiting to
be computed before computing the result for display at the root.
There are other $! operators in the tree operations to collapse at that level,
and now the program runs in constant memory space.

--
Drew Haven
drew.haven at gmail.com



On Mon, Jun 7, 2010 at 5:06 AM, Anand Mitra <anand.mitra at gmail.com> wrote:
> Hello All,
>
> I want to build a program which will recursively scan a directory and
> build md5sum for all the files. The intent is to do something similar
> to unison but more specific to my requirements. I am having trouble in
> the initial part of building the md5sums.
>
> I did some digging around and found that "System.Directory.Tree" is a
> very close match for what I want to do. In fact after a little poking
> around I could do exactly what I wanted.
>
> ,----
> | import Monad
> | import System.Directory.Tree
> | import System.Directory
> | import Data.Digest.Pure.MD5
> | import qualified Data.ByteString.Lazy.Char8 as L
> |
> | calcMD5 =
> |     readDirectoryWith (\x-> liftM md5 (L.readFile x))
> `----
>
> This work perfectly for small directories. readDirectoryWith is
> already defined in the library and exactly what we want
>
> ,----
> | *Main> calcMD5 "/home/mitra/Desktop/"
> |
> | "/home/mitra" :/ Dir {name = "Desktop", contents = [File {name =
> | "060_LocalMirror_Workflow.t.10.2.62.9.log", file =
> | f687ad04bc64674134e55c9d2a06902a},File {name = "cmd_run", file =
> | 6f334f302b5c0d2028adeff81bf2a0d9},File {name = "cmd_run~",
> `----
>
> However when ever I give it something more challenging it gets into
> trouble.
>
> ,----
> | *Main> calcMD5 "/home/mitra/laptop/"
> | *** Exception: /home/mitra/laptop/ell/calc-2.02f/calc.info-27:
> |    openFile: resource exhausted (Too many open files)
> | *Main> 29~
> `----
>
> If I understand what is happening it seems to be doing all the opens
> before consuming them via md5. This works fine for small directories
> but for any practical setup this could potentially be very large. I
> tried forcing the md5 evaluation in the hope that the file descriptor
> will be freed once the entire file is read. That did not help, either
> because I could not get it right or there is some more subtle I am
> missing.
>
> I also had a look at the code in module "System.Directory.Tree" and
> although it gave me some understanding of how it works I am no closer
> to a solution.
>
> regards
> --
> Anand Mitra
>
>
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners
>
>


More information about the Beginners mailing list