[Haskell-cafe] How can I improve the pipes's performance with a huge file?
zhangjun.julian
zhangjun.julian at gmail.com
Fri Nov 14 09:43:15 UTC 2014
Dear cafe
I have 2 file, I want zip the 2 file as couple, and then count each couple's repeat times?
The file had more than 40M rows, I use pipe to write code as blow.
When I test with 8768000 rows input, it take 30 secs
When I test with 18768000 rows input, it take 74 secs
But when I test with whole file (40M rows), it take more than 20 minutes and not finished yet.
It take more than 9G memorys, and the disk is also busy all time.
The result will less than 10k rows, so I had no idea why the memory is so huge.
I had use the “http://hackage.haskell.org/package/visual-prof” to profile and improve the performance with the small file
But I don’t know how to deal with the “hang” situation.
Anyone can give me some help, Thanks.
===================================
import System.IO
import System.Environment
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Map as DM
import Data.List
emptyMap = DM.empty::(DM.Map (String,String) Int)
keyCount num = do
readHandle1 <- openFile "dataByColumn/click" ReadMode
readHandle2 <- openFile "dataByColumn/hour" ReadMode
writeHadle <- openFile "output" AppendMode
rCount num readHandle1 readHandle2 writeHadle
hClose writeHadle
hClose readHandle1
hClose readHandle2
mapToString::DM.Map (String,String) Int-> String
mapToString m = unlines $ map eachItem itemList
where
itemList = DM.toList m
eachItem ((x,y),i) = show x ++ "," ++ show y ++ "," ++ show i
--rCount::Int -> [String] -> Handle->Handle -> IO()
rCount num readHandle1 readHandle2 writeHadle = do
rt <- P.fold (\x y -> DM.unionWith (+) x y) emptyMap id $ P.zipWith (\x y -> DM.singleton (x,y) 1) (P.fromHandle readHandle1) (P.fromHandle readHandle2) >-> P.take num
hPutStr writeHadle $ mapToString rt
main = do
s<- getArgs
let num = (read . head) s
keyCount num
More information about the Haskell-Cafe
mailing list