[Haskell-cafe] Re: optimization help

Thu Oct 12 15:33:12 EDT 2006

>> A better solution would be to begin output before the the whole input is
>> read, thus making things more lazy. This can be done the following way:
>> from the input, construct a lazy list of (date,line) pairs. Then, let
>> foldM thread a map from dates to corresponding output file pointers
>> through the list and, at the same time, use the file pointers to output
>> the line in question via appendFile. This way, every line consumed is
>> immediately dispatched to its corresponding output file and things
>> should only require memory for the different dates, besides buffering.
>>
> I tried this approach previously and it seems to be unacceptably slow.
> I thought the slowness was just due to the fact that file operations
> are slow, but I'll include my code here (cleaned up to take some of
> your previous comments into account) just in case there is something
> subtle I'm missing which is slowing down the code (B, M, and myRead
> are as above):
> 
> dates' file nRows = do
>  (cols, rows) <- myRead file
>  dateIx <- M.lookup cols $ B.pack "\"Date\""
>  let writeDate row = B.appendFile (dataDir++fmt date) row where
>          date = (B.split ',' row)!!dateIx
>          fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
> B.takeWhile (/= ' ')
>  oldFiles <- getDirectoryContents dataDir
>  mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles
>  mapM_ writeDate $ take nRows rows
> 
> This code takes over 20 minutes to process 100MB on my machine.

No wonder, as this opens and closes the file on every row. The operating
system will be kept quite busy that way! In some sense, your are
outsourcing the row collecting M.Map to the OS... Of course, you want to
open the files once and dispatch the rows to the different open handles.

Here is a version (untested) which either does the read all then write
approach (group'n'write) or opens the output files simultaneously
(group'n'write2). Note also that there is no need to use M.Map for
finding the "Date" keyword in the CSV header (which even hurts
performance) though the effects are completely negligible.

main = do
  args <- getArgs
  case args of
    ["dates",file,nRows] -> dates file (read nRows)

dates file nRows =
    B.readFile file >>=
        group'n'write . sugarWithDates . take nRows . B.lines

sugarWithDates (header:rows) =
    map (\r -> (B.split ',' r) !! dateIx, r)) rows
    where
    Just dateIx = Data.List.lookup (B.pack "\"Date\"") $
        zip (B.split "," header) [0..]

formatDate    = B.unpack .
    B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ')
date2filename = (dataDir ++) . formatDate

group'n'write = mapM_ writeDate . M.toList . foldl' addDate M.empty
    where
    addDate mp (date,row) =
        M.insertWith date (\new old -> row:old) [] mp
    writeDate (date,rows) =
        B.writeFile (date2filename date) $ B.unlines rows

group'n'write2 =
    foldM addDate M.empty >>= mapM_ hClose . M.elems
    where
    addDate mp (date,row) = do
        (fp,mp) <- case M.lookup date mp of
            Just fp -> return (fp,mp)
            _       -> do
                fp <- openFile (date2filename date) WriteMode
                return (fp, M.insert date fp mp)
        hPut fp row
        return mp

The thing that bugs me is that one cannot separate
    group'n'write2 = write2 . group
where (group) is a pure function.
I think some kind of lazy writeFile could allow this.

> thanks for your help,
No problem. :)

Regards,
apfelmus