[Haskell-cafe] operating on a hundred files at once

Duncan Coutts duncan.coutts at worc.ox.ac.uk
Mon Apr 9 23:16:50 EDT 2007


On Mon, 2007-04-09 at 21:24 -0400, Jefferson Heard wrote:
> It is indeed!  Is that to be found in Control.Monad, I take it?

Yes. Other common derivatives in that module include:

mapM f as = sequence (map f as)
mapM_ f as = sequence_ (map f as)

forM_ = flip mapM_
forM = flip mapM

however, personally, I think it's better style to use sequence/sequence_
directly since it shows more nicely how the IO is only going on at the
top level whereas forM_ starts to feel very imperative.

A nice style, imho, is to use sequence and list comprehensions, eg,
reformulating your original program:

main = do
  files <- getArgs
  sequence_
    [ print . map (mean . map read . split ',') . lines =<< readFile file
    | file <- files ]
  sequence_
    [ print . map (variance . map read . split ',') . lines =<< readFile file
    | file <- files ]

Note, that like in your original we read each file twice, once for the
mean and once for the variance. In your program you share the action to
do the reading, but not the result of the action, so you do execute
those file reading actions twice.

Let's try fixing that:

main = do
  files <- getArgs
  items <- sequence
    [ return . map (map read . split ',') . lines =<< readFile file
    | file <- files ]
  sequence_
    [ print (map mean item)
    | item <- items ]
  sequence_
    [ print (map variance item)
    | item <- items ]

The next problem to note is that we read all the files before
calculating the info we want from each file. So opening 100 files might
be ok but 100,000 might not, so lets refactor again:

main = do
  files <- getArgs
  mvs <- sequence
    [ return . calc . map (map read . split ',') . lines =<< readFile file
    | file <- files ]
  let (ms, vs) = unzip mvs
  mapM_ print ms
  mapM_ print vs

calc items = (map mean items, map variance items)

However this doesn't quite cut it either since nothing forces the mean
and variance to be calculated until the very end where they get printed.
This means the file will need to be held open until then too. We need to
force the calculation of the mean and variance which will consume the
file contents and allow the file to be closed.

We could do this by printing the mean and variance out earlier (ie as
soon as we've read the file) but that's changing the order of the lines
in the output of your program (you print all the means followed by all
the variance lines) so instead lets add a bit of strictness:

import Control.Exception (evaluate)

main = do
  files <- getArgs
  mvs <- sequence
    [ evaluate . calc . map (map read . split ',') . lines =<< readFile file
    | file <- files ]
  let (ms, vs) = unzip mvs
  mapM_ print ms
  mapM_ print vs

calc items = m `seq` v `seq` (m, v)
  where m = map mean items
        v = map variance items

We've changed two bits. We make sure that calc calculates the two values
before returning the pair and we force the evaluation of calc itself
using evaluate. Both are necessary. The evaluate action forces a value
as an IO action, so we can be sure about when the forcing happens in
relation to other IO actions.

Duncan



More information about the Haskell-Cafe mailing list