[Haskell-cafe] Re: optimization help

Thu Oct 12 12:43:35 EDT 2006

Hello,

> When using addDate in foldM like below, you certainly don't want to
> search the cols for the string "Date" again and again everytime addDate
> is called. The index of the "Date" field is a number determined when
> parsing the header. That and only that number has to be plugged in here.
>
Good catch.

> The main thing in the code that makes me feel very very ill is the fact
> that the code is quite "impure" (many many dos). The next line promptly
> bites back:
>
> >   res <- foldM addDate M.empty $ take nRows rows
>
> Did you notice this appeal to addDate makes its callee getCol live in
> the IO-Monad? From the use of M.lookup in getColId, I think you intended
> to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more
> general type
>    M.lookup :: Monad m => _ -> m a,
> so it happily lives in IO.
>
I was aware of this and counted on the lookup causing the program to
stop if the column didn't exist.

> The following line does unnecessary work: myRead splits a row to get
> access to the date, but now you join it without having changed any
> field. It would be wiser to split for the date but to keep an intact
> copy of the line so that you can pass it here without join. This will
> reduce memory footprint.
>
Another good observation which I missed.

> Your current solutions reads the input and "calculates" all output files
> before writing them to disk in a final step. This means that the
> contents of the output files has to be kept in memory. Thus you need
> least a constant * 100MB of memory. I don't know how ByteString
> interacts with garbage collection, but it may well be that by keeping
> the first line (you "cols") in memory, the entire input file contents is
> also kept which means an additional constant * 100 MB. It is likely that
> both can be shared if one resolves the code quirks mentioned above.
>
I intentionally chose this design to minimize the amount of file
access which seems to be quite slow (see below).

After fixing the two slips you pointed out, my code works as expected,
processing 100MB in about 1 minute using around 550MB of heap. Here is
the good version (where B is Data.ByteString.Lazy.Char8 and M is
Data.Map):

myRead file = do
  v <- B.readFile file
  let (cols' : rows) = B.lines v
      cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty $ zip
(B.split ',' cols') [0 ..]
  return (cols, rows)

dates file nRows = do
    (cols, rows) <- myRead file
    dateIx <- M.lookup (B.pack "\"Date\"") cols
    let addDate mp row = M.insert date (row:old) mp where
            date = (B.split ',' row)!!dateIx
            old = M.findWithDefault [] date mp
        res = foldl addDate M.empty $ take nRows rows
    mapM_ writeDate $ M.toList res
  where
    fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
    writeDate (date,rows) = B.writeFile (dataDir++fmt date) (B.unlines rows)

> A better solution would be to begin output before the the whole input is
> read, thus making things more lazy. This can be done the following way:
> from the input, construct a lazy list of (date,line) pairs. Then, let
> foldM thread a map from dates to corresponding output file pointers
> through the list and, at the same time, use the file pointers to output
> the line in question via appendFile. This way, every line consumed is
> immediately dispatched to its corresponding output file and things
> should only require memory for the different dates, besides buffering.
>
I tried this approach previously and it seems to be unacceptably slow.
I thought the slowness was just due to the fact that file operations
are slow, but I'll include my code here (cleaned up to take some of
your previous comments into account) just in case there is something
subtle I'm missing which is slowing down the code (B, M, and myRead
are as above):

dates' file nRows = do
  (cols, rows) <- myRead file
  dateIx <- M.lookup cols $ B.pack "\"Date\""
  let writeDate row = B.appendFile (dataDir++fmt date) row where
          date = (B.split ',' row)!!dateIx
          fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
  oldFiles <- getDirectoryContents dataDir
  mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles
  mapM_ writeDate $ take nRows rows

This code takes over 20 minutes to process 100MB on my machine.

> In a setting without IO, the task corresponds to the "Optimization
> Problem" discussed at length in September on this list. The problem here
> is that writeFile currently cannot be interleaved lazily, this has to be
> simulated with appendFile. We can read files lazily but we cannot output
> them lazily.
> Can this be remedied? Can there be a version of writeFile which is, in a
> sense, dual to getContents?
>
Wouldn't this require blocking IO?

thanks for your help,
  Jeff