[Haskell-cafe] How can I improve the pipes's performance with a huge file?

Fri Nov 14 09:43:15 UTC 2014

Dear cafe

I have 2 file, I want zip the 2 file as couple, and then count each couple's repeat times?

The file had more than 40M rows, I use pipe to write code as blow.

When I test with 8768000 rows input, it take 30 secs
When I test with 18768000 rows input, it take 74 secs

But when I test with whole file (40M rows), it take more than 20 minutes and  not finished yet.
It take more than 9G  memorys, and the disk is also busy all time.

The result will less than 10k rows, so I had no idea why the memory is so huge.

I had use the “http://hackage.haskell.org/package/visual-prof” to profile and improve the performance with the small file
But I don’t know how to deal with the “hang” situation.

Anyone can give me some help, Thanks.

===================================
import System.IO
import System.Environment
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Map as DM
import Data.List

emptyMap = DM.empty::(DM.Map (String,String) Int)

keyCount num = do
	readHandle1 <- openFile "dataByColumn/click" ReadMode
	readHandle2 <- openFile "dataByColumn/hour" ReadMode
	writeHadle <- openFile "output" AppendMode
	rCount num readHandle1 readHandle2 writeHadle
	hClose writeHadle
	hClose readHandle1
	hClose readHandle2

mapToString::DM.Map (String,String) Int-> String
mapToString m = unlines $ map eachItem itemList
	where 
		itemList = DM.toList m
		eachItem ((x,y),i) = show x ++ "," ++ show y ++ "," ++ show i 

--rCount::Int -> [String] -> Handle->Handle -> IO()
rCount num readHandle1 readHandle2 writeHadle = do 
	rt <- P.fold (\x y -> DM.unionWith (+) x y) emptyMap id $  P.zipWith (\x y -> DM.singleton (x,y) 1) (P.fromHandle readHandle1) (P.fromHandle  readHandle2) >-> P.take num
	hPutStr writeHadle $ mapToString  rt

main = do 
	s<- getArgs
	let num = (read . head) s 
	keyCount num