[Haskell-cafe] Strange memory consumption problems in something that should be tail-recursive

Kirsten Chevalier catamorphism at gmail.com
Tue Feb 13 15:36:10 EST 2007


[redirecting to ghc-users since this is a GHC question]

On 2/13/07, Jefferson Heard <jeff at renci.org> wrote:
> Hi, I am running the following code against a 210 MB file in an attempt to
> determine whether I should use alex or whether, since my needs are very
> performance oriented, I should write a lexer of my own.  I thought that
> everything I'd written here was tail-recursive, but after compiling this with
> GHC 2.4.6, and running it, I eat up 2GB of RAM in less than a second.  So
> far, I have tried token and character oriented Parsec parsers and alex and
> alex is winning by a factor of 2.  I would like to be able to tokenize the
> entirety of a 1TB collection in less than 36 hours on my current machine,
> which is where alex has gotten me so far.  Thanks in advance!
>
>  -- Jeff
>
> ---
>
> module Main
>     where
>
>
> import qualified FileReader
> import qualified Data.Set as Set
>
> punct = foldl (flip Set.insert) Set.empty "<,>.?/:;\"'{[}]|\\_-+=)
> (*&^%$##@!~`"
>
> stripTagOrComment [] = []
> stripTagOrComment ('>':rest) = rest
> stripTagOrCOmment (c:rest) = stripTagOrComment rest
>
> pass1 :: String -> String -> String
> pass1 left [] = left
> pass1 left ('<':right) = pass1 left (stripTagOrComment right)
> pass1 left (' ':right) = pass1 left right
> pass1 left (c:right)
>     | Set.member c punct = pass1 (' ':c:' ':left) right
>     | otherwise          = pass1 (c:left) right
>
>
> pass2 :: [String] -> String -> Char -> String -> [String]
> pass2 left word ' ' [] = word:left
> pass2 left word c [] = (c:word):left
> pass2 left word ' ' (' ':right) = pass2 left word ' ' right
> pass2 left word ' ' (c:right) = pass2 (word:left) "" c right
> pass2 left word l (c:right) = pass2 left (l:word) c right
>
> tokenize = (pass2 [] "" ' ') . (pass1 [])
>
> main = do
>   file <- do FileReader.trecReadFile "trecfile"
>   print (tokenize (head (tail file)))
>
>
> --  print (length (map (runParser tokenizeDoc [] "") file))

Have you tried profiling? (see section 5 of the GHC manual.) What's
your GHC command line? Tail-recursion in Haskell doesn't always work
the way you'd expect, but without profiling it's pretty hard to tell
what the problem is.

Cheers,
Kirsten

-- 
Kirsten Chevalier* chevalier at alum.wellesley.edu *Often in error, never in doubt
"Relax. I'm weird, not violent."--Brad Boesen, _Disturbed_


More information about the Glasgow-haskell-users mailing list