[Haskell-beginners] Processing a list of files the Haskell way

Chaddaï Fouché chaddai.fouche at gmail.com
Tue Mar 13 19:47:19 CET 2012


On Tue, Mar 13, 2012 at 4:34 PM, Michael Schober <Micha-Schober at web.de> wrote:
> Hi again,
>
> I'm still trying to figure out my earlier problem regarding the directory
> tree and functions on the files.
>
> What I'm really trying to do sounds simple enough: find duplicate files for
> a complete directory, given a root. The original approach I had in mind was
> to compute every file's checksum, which I could feed into a map of
> (checksums,[filepaths]).

Is there a problem with the solution I gave you ?


> On 03/10/2012 01:48 PM, Chaddaï Fouché wrote:
>>
>> Note that you solution isn't very "functional-like", but rather
>> imperative. On the other hand, making it more functional in this
>> particular case come with its own brand of subtle difficulties.
>
>
> I now wonder how to achieve my goal properly in Haskell. Obviously I have
> some sequential parts of my program (reading directory trees is I/O and
> therefore monadic), but it seems to me that could also be some functional
> part as well (pairing files with checksums, pairing checksums with common
> filepaths, filtering only checksums with more than one filepath, maybe even
> sorting, etc.).
>
> How would you go about to do it? Which parts of the program should be
> monadic, which functional? Would you create a custom-build monad or would
> you use build-ins (which ones?)?
>
> I would really like to have some pointers how an experienced Haskell
> developer would go about such a thing, since I want to acquire a Haskell
> feel of programming rather than keep on programming in my old ways, now in
> Haskell. I would also be happy to be simply pinpointed to other applications
> which source code are exemplary, but not too huge to dig through in a
> sensible amount of time. (I'm currently trying to work through the Darcs
> source code, but that seems a bit too overkill.)
>

An "experienced" Haskell programmer would probably reuse some of the
"streaming" solutions on Hackage, combine it with md5 and a map :

> module Main where
>
> import Data.Conduit.Filesystem (traverse)
> import qualified Data.Conduit.List as CL
> import Data.Conduit
>
> import Data.Digest.Pure.MD5 (MD5Digest)
> import Crypto.Conduit (hashFile)
>
> import qualified Data.Map as M
> import qualified Filesystem.Path.CurrentOS as FP
> import System.Environment
>
> duplicates :: FilePath -> IO [[FilePath]]
> duplicates dir = runResourceT $ do
>   md5s <- traverse False (FP.decodeString dir) $$ CL.mapM process =$ CL.fold buildMap M.empty
>   return . M.elems . M.filter ((>1).length) $ md5s
>     where
>       process :: FP.FilePath -> IO (MD5Digest, FilePath)
>       process fp = do
>         let strFp = FP.encodeString fp
>         md5 <- hashFile strFp
>         return (md5,strFp)
>       buildMap m (md5,fp) = M.insertWith' (flip (++)) md5 [fp] m
>
> main = do
>   [dir] <- getArgs
>   print =<< duplicates dir

This may be a bit too much and there's a bit of "noise" introduced by
the two FilePath type (the standard one which is a synonym for String,
and the one used in Filesystem.Path and filesystem-conduit which was
introduced to better cope with some exotic filepath and encoding in
filesystems), on the other hand, this is not too inefficient (though
it may be improved by using another structure than Map, and maybe by
using nano-md5 rather than pureMD5) and the memory used is only
proportional to the number of files in the directory which should be
ok.

-- 
Jedaï



More information about the Beginners mailing list