[Haskell-cafe] Speedy parsing

Thu Jul 19 18:58:05 EDT 2007

I was hoping someone could direct me to material on how I might go about
optimizing a small parsing program I wrote.  Used for calculating
differences between two files that store a hash of hashes (each line is
a hash, each line has "key=value" pairs seperated by commas), the
bottleneck seems to be in getting the data into memory (in a usable
structure) quickly, as I'm working with extremely large files (>250,000
line, 100MB+ files are normal in size, some are quite larger, and I have
to process a large number of them often).

Extracting just the parsing elements of my current python
implementation:
FILE: compare.py
> import sys
> def parse(lines):
>    print "Parsing",len(lines),"lines..."
>    failed = 0
>    hashes = {}
>    for line in lines:
>       try:
>          hash = {}
>          parts = line.split(",")
>          for part in parts:
>             k,v = part.split("=")
>             hash[k] = v
>          hashes[ hash["LogOffset"] ] = hash
>       except:
>          if line != "":
>             failed += 1
>    if failed > 0:
>       print "[ERROR] Failed to parse:",failed,"lines"
>    print "...Parsing resulted in",len(hashes),"unique hashes"
>    return hashes
> 
> def normalize(lines):
>    lineset = set()
>    for line in lines:
>       if "Type=Heartbeat" == line[0:14]: pass
>       elif line == "": pass
>       else: lineset.add( line ) #use set to get only uniques
>    return lineset
> 
> def main():
>    hashes = parse( normalize ( open(sys.argv[1]).readlines() ))
> 
> if __name__ == '__main__': main()

$ time python compare.py 38807lineFile
Removed 52 bad lines
Parsing 38807 lines...
...Parsing resulted in 38807 unique hashes

real    0m3.807s
user    0m3.330s
sys     0m0.470s

$ time python compare.py 255919lineFile
Removed 0 bad lines
Parsing 255919 lines...
...Parsing resulted in 255868 unique hashes

real    0m30.889s
user    0m23.970s
sys     0m2.900s

*note: profiling shows over 7.1 million calls to the split() function,
to give you a good idea of the number of pairs the file contains.

Once you factor in much increased filesizes, actually preforming the
analysis, and running it on a few dozen files, my tests started to
become quite time consuming (not to mention it takes 1GB of memory for
just the 250K line files, although execution time is still much more
important at the moment thanks to ram being cheap).  

Thusly, I figured I'd rewrite it in C, but first I wanted to give
Haskell a shot, if only to see how it compared to python (hoping maybe I
could convince my boss to let me use it more often if the results were
good).  The first thing I wanted to check was if parsec was a viable
option.  Without even putting the data into lists/maps, I found it too
slow.

FILE: compare_parsec.hs
> {-# OPTIONS_GHC -O2 #-}
> module Main where
> import System.Environment (getArgs)
> import Text.ParserCombinators.Parsec
>
> csv = do x <- record `sepEndBy` many1 (oneOf "\n")
>          eof
>          return x
> record = field `sepBy` char ','
> field = many (anyToken)
>
> main = do
>    ~[filename] <- getArgs
>    putStrLn "Parsing log..."
>    res <- parseFromFile csv filename
>    case res of
>       Left err -> print err
>       Right xs -> putStrLn "...Success"

$ time ./compare_parsec 38807lineFile
Parsing log...
...Success

real    0m13.809s
user    0m11.560s
sys     0m2.180s

$ time ./compare_parsec 255919lineFile
Parsing log...
...Success

real    1m28.983s
user    1m8.480s
sys     0m9.530s

This, sadly, is significantly worse than the python code above.  Perhaps
someone here can offer advice on more efficient use of parsec?
Unfortunately I don't have profiling libraries for parsec available on
this machine, nor have I had any luck finding material on the web.

After this, I tried doing my own parsing, since the format is strict and
regular.  I tried pushing it to lists (fast, not very usable) and maps
(much easier for the analysis stage and what the python code does, but
much worse speedwise).

FILE: compare_lists.hs
> {-# OPTIONS_GHC -O2 -fglasgow-exts #-}
> module Main where
> import System.Environment (getArgs)
> type Field = (String,String)
> type Record = [Field]
> type Log = [Record]
>
> main = do
>    ~[filename1] <- getArgs
>    file1 <- readFile filename1
>    putStrLn "Parsing file1..."
>    let data1 = parseLog file1
>    print data1
>    putStrLn "...Done"
>
> -- parse file
> parseLog :: String -> Log
> parseLog log = foldr f [] (lines log)
>    where f "" a = a
>          f "\n" a = a
>          f x a = (parseRecord x):a
> -- parse record
> parseRecord :: String -> Record
> parseRecord record = foldr (\x a -> (parseField x):a) [] (split ','
record)
> -- parse field
> -- no error detection/handling now
> parseField :: String -> Field
> parseField s = (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
>
> isntCharEq :: Char -> Bool
> isntCharEq '=' = False
> isntCharEq _ = True
>
> split :: Eq a => a -> [a] -> [[a]]
> split delim = foldr f [[]]
>    where
>       f x rest@(r:rs)
>         | x == delim = [] : rest
>         | otherwise = (x:r) : rs

I wasn't sure the best way to force evaluation on this, so I opt'd to
print it and direct stdout to /dev/null (and noted how long a program
that just read and print the same file to /dev/null took as well).  Any
better suggestions are welcome.

$ time ./compare_lists 38807lineFile > /dev/null
real    0m4.958s (0m0.445s for the putStrLn $ readFile program)
user    0m4.830s
sys     0m0.110s

$ time ./compare_lists 255919lineFile > /dev/null
real    0m31.047s (0m2.620s for the putStrLn $ readFile program)
user    0m30.310s
sys     0m0.690s

These speeds are relatively the same as python, but the data is easily
usable in comparision, so finally I implemented the useful version:

FILE: compare_maps.hs
> {-# OPTIONS_GHC -O2 -fglasgow-exts #-}
> module Main where
> import qualified Data.Map as Map
> import System.Environment (getArgs)
> import Control.Monad
>
> type Field = (String,String)
> type Record = Map.Map String String
> type Log = Map.Map String Record
>
> main = do
>    ~[filename] <- getArgs
>    d <- liftM parseLog $ readFile filename
>    putStr $ strMaps d
> -- parse file
> parseLog :: String -> Log
> parseLog logstr = foldr f Map.empty (lines logstr)
>    where f "" a = a
>          f "\n" a = a
>          f x a = let (k,v) = parseRecord1 x
>                  in Map.insert k v a
> -- parse record line
> parseRecord1 :: String -> (String,Record)
> parseRecord1 recordstr = do let r = parseRecord recordstr
>                             (idRecord r, r)
>
> -- extract unique id used for comparision amoungst line hashes in
other files
> idRecord :: Record -> String
> idRecord r = case (Map.lookup "LogID" r) of
>                Nothing -> "ERROR"
>                Just id -> id
>
> parseRecord :: String -> Record
> parseRecord recordstr = foldr f Map.empty (split ',' recordstr)
>    where f x a = case parseField x of
>                   Nothing -> a
>                   Just (k,v) -> Map.insert k v a
> -- parse k=v fields
> parseField :: String -> Maybe Field
> parseField "" = Nothing
> parseField s = Just (takeWhile isntCharEq s, tail $ dropWhile
isntCharEq s)
>
> isntCharEq :: Char -> Bool
> isntCharEq '=' = False
> isntCharEq _ = True
> -- map functions
> strMaps = Map.fold (\x a -> (strMap x) ++ "\n" ++ a) ""
> strMap = unlines . map (\(k,v) -> k ++ ":" ++ v) . Map.toAscList
> -- list functions
> split :: Eq a => a -> [a] -> [[a]]
> split delim = foldr f [[]]
>    where
>       f x rest@(r:rs)
>         | x == delim = [] : rest
>         | otherwise = (x:r) : rs

$ time ./compare_maps 38807lineFile > /dev/null
real    0m24.207s (0m0.445s for the putStrLn $ readFile program)
user    0m22.430s
sys     0m1.680s

$ time ./compare_maps 255919lineFile > /dev/null
real    5m24.927s (0m2.620s for the putStrLn $ readFile program)
user    5m5.240s
sys     0m12.770s

Until this point I had only been throwing on -O2.  I believe -fvia-C is
included with -O currently, and I'm not well versed enough to know what
others could help me out, so I left it at that.

Next I tried checking out what the garbage collector was doing (I assume
the footprint is smaller than python, but that would still be quite
large).  Flipping on -Sstderr I saw that only a third of the time spent
was doing real work (67.9% doing GC).

For the 38K file:
2,794,465,788 bytes allocated in the heap
1,956,668,360 bytes copied during GC (scavenged)
21,884,948 bytes copied during GC (not scavenged)
185,541,764 bytes maximum residency (16 sample(s))
	5329 collections in generation 0 (  9.03s)
      16 collections in generation 1 (  5.99s)
	523 Mb total memory in use
MUT   time    7.10s  (  7.80s elapsed)
GC    time   15.02s  ( 16.68s elapsed)
Total time   22.12s  ( 24.48s elapsed)
%GC time      67.9%  (68.1% elapsed)

For the 250K file:
17,509,820,952 bytes allocated in the heap
12,270,870,276 bytes copied during GC (scavenged)
136,831,020 bytes copied during GC (not scavenged)
1,307,822,788 bytes maximum residency (21 sample(s))
	33390 collections in generation 0 (121.65s)
	21 collections in generation 1 ( 42.37s)
	3244 Mb total memory in use
MUT   time  146.25s  (151.51s elapsed)
GC    time  164.02s  (172.73s elapsed)
Total time  310.27s  (324.24s elapsed)
%GC time      52.9%  (53.3% elapsed)

Not being paticularly knowledgeable in this area, I tried some options I
thought were reasonable after reading the
(http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control
.html#rts-options-gc).  The following are with the small (38K) file.

Running with a larger heap (-H) didn't seem to help:
256m 35s
512m 37s
768m 37s
1g   29s

Playing with the allocation area size (-A) didn't help either:
128k 29s
256k 25s (default)
512k 23s
768k 25s
1m   23s
2m   32s
5m   32s
10m  32s

The 250K file shows improvement from the little bit I've played around
with
defaults -> 5m24s, 33.3K collections, 53% GC time
-H1g -> 4m53s, 16.8K collecctions, 56% GC time
-H3g -A5m -> 4m16s total time, 3500 collections, 52% GC time
-H4g -A50m -> 4m26s total time, 350 collections, 57% GC time

 (-H3g -A5m and -H4g -A50m give approx. 4m16s total time, 3500
collections, but still 52% GC time), so I'll try to run some more tests
with different options for it, but this still isn't even remotely close
to python's speed, so I assume improving the code would prove more
fruitful.

At this point I'm out of ideas, so I was hoping someone could identify
something stupid I've done (I'm still novice of FP in general, let alone
for high performance) or direct me to a guide,website,paper,library, or
some other form of help.

For reference, the tests above were done on an 8 proc (possibly 4 real
but hyperthreaded?) 2.8ghz Xeon workstation with 16gb memory runing an
SMP enabled 2.4.21-47 linux kernel and ghc 6.6.1

Thanks

-- Joseph Re
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.