[Haskell-cafe] Strange memory consumption problems in something that should be tail-recursive

Tue Feb 13 15:27:48 EST 2007

Hi, I am running the following code against a 210 MB file in an attempt to 
determine whether I should use alex or whether, since my needs are very 
performance oriented, I should write a lexer of my own.  I thought that 
everything I'd written here was tail-recursive, but after compiling this with 
GHC 2.4.6, and running it, I eat up 2GB of RAM in less than a second.  So 
far, I have tried token and character oriented Parsec parsers and alex and 
alex is winning by a factor of 2.  I would like to be able to tokenize the 
entirety of a 1TB collection in less than 36 hours on my current machine, 
which is where alex has gotten me so far.  Thanks in advance!

 -- Jeff

---

module Main 
    where

import qualified FileReader
import qualified Data.Set as Set

punct = foldl (flip Set.insert) Set.empty "<,>.?/:;\"'{[}]|\\_-+=)
(*&^%$##@!~`"

stripTagOrComment [] = []
stripTagOrComment ('>':rest) = rest
stripTagOrCOmment (c:rest) = stripTagOrComment rest

pass1 :: String -> String -> String
pass1 left [] = left
pass1 left ('<':right) = pass1 left (stripTagOrComment right)
pass1 left (' ':right) = pass1 left right
pass1 left (c:right) 
    | Set.member c punct = pass1 (' ':c:' ':left) right
    | otherwise          = pass1 (c:left) right

pass2 :: [String] -> String -> Char -> String -> [String]
pass2 left word ' ' [] = word:left
pass2 left word c [] = (c:word):left
pass2 left word ' ' (' ':right) = pass2 left word ' ' right
pass2 left word ' ' (c:right) = pass2 (word:left) "" c right
pass2 left word l (c:right) = pass2 left (l:word) c right

tokenize = (pass2 [] "" ' ') . (pass1 [])

main = do
  file <- do FileReader.trecReadFile "trecfile"
  print (tokenize (head (tail file))) 

--  print (length (map (runParser tokenizeDoc [] "") file))