apfelmus' interface degustation - System.FilePath.Find

Sun Jul 1 05:23:58 EDT 2007

Bryan O'Sullivan wrote:
>>  b) Making (FindClause a) a mere type synonym of (FileInfo -> a)
>>     has the benefit that the user can choose whether he wants to use
>>     monads or applicative functors via the respective instances
>>     or whether he does not.
> 
> That's where I had started out, as a matter of fact.

Yes. What I want to say is that you're still there, even if not
intended. Although FindClause is implemented as a state monad, the functions

 evalClause          :: FindClause a    -> (FileInfo -> a)
 (`liftM` fileInfo)  :: (FileInfo -> a) -> FindClause a

are isomorphisms. In other words, nothing is gained or lost compared to
(FileInfo -> a). Interestingly, these functions can be defined for any
state monad, but in general, their composition is not the identity. Here
it is because  put  is not available. Whether an abstract data type is
isomorphic to some other type depends on what functions are defined but
also on what functions are not defined.

>> or for general applicative functors as
>>
>>  (||) <$> ((== ".hs") <$> extension) <*> ((== ".lhs") <$> extension)
> 
> I don't find that very readable, I must confess.

Neither do I. It's probably due to (||) and (==) being infix operators,
it's quite readable for normal functions. I think it's indeed best to
have custom infix operators like (==?) and (||?).

Btw, liftOp works for any functor

  liftOp :: Functor f => (a -> b -> c) -> f a -> b -> f c
  liftOp = flip . (fmap .) . flip

>> Maybe unsafePerformIO is the best solution, because you may safely close
>> the file after having evaluated
>>
>>   predicate (unsafePerformIO $ hGetContents handle) (fileinfo)
>>
>> to weak head normal form, i.e. True or False. I think it fits perfectly.
> 
> In principle
> 
>   unsafeInterleaveIO $ readFile fileName
> 
> ought to be better, because it will not try to open the file unless the
> predicate actually inspects it, and opening files is expensive.  But it
> will also not close the file until a finalizer kicks in.  A tidier
> approach might be:
> 
>     maybeH <- newIORef Nothing
>     contents <- unsafeInterleaveIO $ do
>         h <- openFile fileName ReadMode
>         writeIORef maybeH (Just h)
>         hGetContents h
>     let result = predicate contents
>     result `seq` readIORef maybeH >>= maybe (return ()) hClose
> 
> That's a little cumbersome, but very appealing from the perspective of a
> user of the library.  Unfortunately, it looks to me like it's not very
> safe; see below.

Yes, this code looks best. And I think it's safe if the predicate may
only return unary constructors like True or False. I mean, evaluating
this to WHNF is like evaluating it to NF and every unused part of the
file can safely be discarded, since there are no unevaluated thunks
depending on the file contents anymore (and which are not garbage).

>> Using
>> System.FilePath.Find.fold gives you both file status and file path but
>> the ought-to-be equivalent approach of using foldl' on the list returned
>> by find only gives the file path but no file status. So, the suggestion
>> is to make find return a list of FileInfo
> 
> Let me pass an idea by you.  There's a problem with augmenting FileInfo
> to potentially cause IO to occur as above (both with your original
> suggestion and my changes), no?  We lose the ability to control when a
> file might be opened, and when it ought to be closed.

I agree, augmenting FileInfo with actual file contents is probably not
advisable, although it would be very functional in style. Better ditch
the idea for now.

However, we probably could safely salvage two lazy reading operations:
 1) reading the FileStatus record of a file
 2) reading file contents if to be used in a True/False predicate
The common pattern is that the information extracted is small enough to
be deepSeq'ed, so that the file can be closed early once the lazy read
is triggered.

> If that were the case, the fold interface would have the same problem,
> if it decided for some reason to sock away FileInfo values in an
> accumulator and work on them after the fold had completed.

(The rank-2 trick can prevent FileInfos to leak out into the result of
the fold

 fold :: (forall s . a -> FileInfo s -> a) -> FilePath -> IO a

but that doesn't help very much as long as one would want  a  to be able
to depend on the file contents via one of

 contents     :: forall s . FileInfo s -> String
 withContents :: forall s . FileInfo s -> (String -> a) -> a

so this is not an option.)

Concerning fold, I think that it is the very core of the directory
traversal algorithm: traversing a tree is a catamorphism. In other
words, every other traversal can be formulated with a fold. For
instance, we currently almost have

  find r p dir = fold r (flip $ (:) . infoPath) [] dir

except that  find  returns a lazy list whereas  fold  traverses the
directory tree before returning the list.

But fold is currently not the general catamorphism, which is

  fold :: (FileInfo -> [a] -> a) -> (FileInfo -> a) -> FilePath -> IO a

(I have been tricked by the name 'Foldable', this class does not give
the catamorphism I had in mind.) This fold has to be lazy to be useful
since only this allows to skip hole subdirectories. With the
catamorphism, we can express  find  and the  old fold:

  toList r p = fold branch leaf
    where
    branch dir xs = if r dir  then concat xs else []
    leaf   file   = if p file then [file] else []

  find r p = map infoPath . toList r p

  oldfold r f x root =
    listSeq `fmap` foldl' f x `fmap` toList r (const True) root
    where
    listSeq xs = length xs `seq` xs

(Btw, does the old fold only fold the value of type  a  over regualar
files or also over directories?)

Being able to return FileInfos captures reading operation 1) but misses
the opportunity 2). However, 2) can simply be supplied as an extra function

  contentsSatisfies :: (String -> Bool) -> FilePath -> Bool
  contentsSatisfies f file = unsafePerformIO $ do
     h <- openFile file ReadMode
     b <- f `fmap` hGetContents h
     b `seq` hClose h
     return b

to be used at the users discretion.

>> Of course, this leads to the question whether  find  should be factored
>> even more into generate & prune
>>
>>   find r p dir = map filepath . filter p . directoryWalk r dir
>>
>> with the intention that System.FilePath.Find only exports directoryWalk.
> 
> That's a nice idea, but subject to the same problems as the fold
> interface would be if we added file contents to the interface.
>
> Your other observations regarding making a directory tree abstract, and
> instances of some of our favourite typeclasses, are very appealing.

As soon as we supply a (lazy) catamorphism like above, the user can make
a concrete directory tree

 data File = Directory FileInfo [File] | File FileInfo

 (fold Directory File) :: FilePath -> IO File

and the only way to keep the type abstract is to only supply weaker
stuff than fold. Which is probably too weak then.

Regards,
apfelmus