[Haskell-cafe] Re: optimization help

Thu Oct 12 08:42:14 EDT 2006

>  I have been trying to do some CSV-style processing. My code works
> fine for small input (up to 10MB), but performs poorly for moderate to
> large input (it can't seem to finish 100MB of input with 700MB heap
> space). I have gone through several optimization passes with profiler
> help, and now I am hoping someone else can point out some other
> approaches to improving the code's performance (both space and time).
> 
>  The code breaks a large file into smaller files all of whose entries
> have the same date.

First of all, for this problem and 100 MB input, you have to think
carefully about what you do. I'll point out three quirks in your code
and afterwards discuss how a better solution looks like.

> module Main where
> 
> import Debug.Trace
> import Control.Monad
> import Data.List
> import qualified Data.ByteString.Lazy.Char8 as B
> import qualified Data.Map as M
> import System.Environment (getArgs)
> 
> dataDir = "dataH/"
> 
> myRead file = do
> v <- B.readFile file
> let (cols' : rows) = map (B.split ',') $ B.lines v
> let cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty (zip cols' [0 ..])
> return (cols, rows)
> 
> getColId cols col = M.lookup col cols
>
> getCol cols col row = do
> i <- getColId cols col
> return $! row!!i
>
> 
> dates file nRows = do
>   (cols, rows) <- myRead file
>   let addDate mp row | mp `seq` row `seq` False = undefined
>                                | otherwise = do

When using addDate in foldM like below, you certainly don't want to
search the cols for the string "Date" again and again everytime addDate
is called. The index of the "Date" field is a number determined when
parsing the header. That and only that number has to be plugged in here.
Thus the next line should read
            let date = row !! datefieldindex
instead of
>           date <- getCol cols (B.pack "\"Date\"") row

>           let old = M.findWithDefault [] date mp
>           return $ M.insert date (row:old) mp

The main thing in the code that makes me feel very very ill is the fact
that the code is quite "impure" (many many dos). The next line promptly
bites back:

>   res <- foldM addDate M.empty $ take nRows rows

Did you notice this appeal to addDate makes its callee getCol live in
the IO-Monad? From the use of M.lookup in getColId, I think you intended
to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more
general type
   M.lookup :: Monad m => _ -> m a,
so it happily lives in IO.
I strongly suggest that you restructure your code and restrict IO to one
place only:

   main = do
     ..
     input <- B.readFile file
     let outs = busywork input
     mapM_ [writeFile name contents | (name,contents) <- outs]

where busywork does the work and is purely functional.

>   mapM_ writeDate $ M.toList res
> where
>   fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
> B.takeWhile (/= ' ')
>   writeDate (date,rows) =
>       B.writeFile (dataDir++fmt date)

The following line does unnecessary work: myRead splits a row to get
access to the date, but now you join it without having changed any
field. It would be wiser to split for the date but to keep an intact
copy of the line so that you can pass it here without join. This will
reduce memory footprint.

>                   (B.unlines $ map (B.join (B.pack ",")) rows)

>
> main = do
>   args <- getArgs
>   case args of
>     ["dates",file,nRows] -> dates file (read nRows)

To summarize, the code is not very clean and several things slipped in,
just as one would expect from an imperative style. The key is to
separate concerns, which means here: IO will just do very dumb in and
output, fetching the index of the "Date" from the header is handled
separately, grouping the lines by date is to be separated from the
to-be-output-contents of the lines.

Now, we'll think about how to solve the task in reasonable time and space.

Your current solutions reads the input and "calculates" all output files
before writing them to disk in a final step. This means that the
contents of the output files has to be kept in memory. Thus you need
least a constant * 100MB of memory. I don't know how ByteString
interacts with garbage collection, but it may well be that by keeping
the first line (you "cols") in memory, the entire input file contents is
also kept which means an additional constant * 100 MB. It is likely that
both can be shared if one resolves the code quirks mentioned above.

A better solution would be to begin output before the the whole input is
read, thus making things more lazy. This can be done the following way:
from the input, construct a lazy list of (date,line) pairs. Then, let
foldM thread a map from dates to corresponding output file pointers
through the list and, at the same time, use the file pointers to output
the line in question via appendFile. This way, every line consumed is
immediately dispatched to its corresponding output file and things
should only require memory for the different dates, besides buffering.

In a setting without IO, the task corresponds to the "Optimization
Problem" discussed at length in September on this list. The problem here
is that writeFile currently cannot be interleaved lazily, this has to be
simulated with appendFile. We can read files lazily but we cannot output
them lazily.
Can this be remedied? Can there be a version of writeFile which is, in a
sense, dual to getContents?

Regards,
apfelmus