[Haskell-cafe] Strange memory consumption problems in something
that should be tail-recursive
Jefferson Heard
jeff at renci.org
Tue Feb 13 15:27:48 EST 2007
Hi, I am running the following code against a 210 MB file in an attempt to
determine whether I should use alex or whether, since my needs are very
performance oriented, I should write a lexer of my own. I thought that
everything I'd written here was tail-recursive, but after compiling this with
GHC 2.4.6, and running it, I eat up 2GB of RAM in less than a second. So
far, I have tried token and character oriented Parsec parsers and alex and
alex is winning by a factor of 2. I would like to be able to tokenize the
entirety of a 1TB collection in less than 36 hours on my current machine,
which is where alex has gotten me so far. Thanks in advance!
-- Jeff
---
module Main
where
import qualified FileReader
import qualified Data.Set as Set
punct = foldl (flip Set.insert) Set.empty "<,>.?/:;\"'{[}]|\\_-+=)
(*&^%$##@!~`"
stripTagOrComment [] = []
stripTagOrComment ('>':rest) = rest
stripTagOrCOmment (c:rest) = stripTagOrComment rest
pass1 :: String -> String -> String
pass1 left [] = left
pass1 left ('<':right) = pass1 left (stripTagOrComment right)
pass1 left (' ':right) = pass1 left right
pass1 left (c:right)
| Set.member c punct = pass1 (' ':c:' ':left) right
| otherwise = pass1 (c:left) right
pass2 :: [String] -> String -> Char -> String -> [String]
pass2 left word ' ' [] = word:left
pass2 left word c [] = (c:word):left
pass2 left word ' ' (' ':right) = pass2 left word ' ' right
pass2 left word ' ' (c:right) = pass2 (word:left) "" c right
pass2 left word l (c:right) = pass2 left (l:word) c right
tokenize = (pass2 [] "" ' ') . (pass1 [])
main = do
file <- do FileReader.trecReadFile "trecfile"
print (tokenize (head (tail file)))
-- print (length (map (runParser tokenizeDoc [] "") file))
More information about the Haskell-Cafe
mailing list