[Haskell-cafe] optimization help

jeff p mutjida at gmail.com
Wed Oct 11 20:20:09 EDT 2006


  I have been trying to do some CSV-style processing. My code works
fine for small input (up to 10MB), but performs poorly for moderate to
large input (it can't seem to finish 100MB of input with 700MB heap
space). I have gone through several optimization passes with profiler
help, and now I am hoping someone else can point out some other
approaches to improving the code's performance (both space and time).

  The code breaks a large file into smaller files all of whose entries
have the same date.



module Main where

import Debug.Trace
import Control.Monad
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.Map as M
import System.Environment (getArgs)

dataDir = "dataH/"

myRead file = do
 v <- B.readFile file
 let (cols' : rows) = map (B.split ',') $ B.lines v
 let cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty (zip cols' [0 ..])
 return (cols, rows)

getColId cols col = M.lookup col cols

getCol cols col row = do
 i <- getColId cols col
 return $! row!!i

dates file nRows = do
   (cols, rows) <- myRead file
   let addDate mp row | mp `seq` row `seq` False = undefined
                                | otherwise = do
           date <- getCol cols (B.pack "\"Date\"") row
           let old = M.findWithDefault [] date mp
           return $ M.insert date (row:old) mp
   res <- foldM addDate M.empty $ take nRows rows
   mapM_ writeDate $ M.toList res
   fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
   writeDate (date,rows) =
       B.writeFile (dataDir++fmt date)
                   (B.unlines $ map (B.join (B.pack ",")) rows)

main = do
   args <- getArgs
   case args of
     ["dates",file,nRows] -> dates file (read nRows)

More information about the Haskell-Cafe mailing list