[Haskell-cafe] Space leak - help needed

Krzysztof Kościuszkiewicz k.kosciuszkiewicz+haskell at gmail.com
Sun Mar 2 21:23:19 EST 2008


Dear Haskellers,

Another story from an (almost) happy Haskell user that finds himself
overwhelmed by laziness/space leaks.

I'm trying to parse a large file (>600MB) with a single S-expression
like structure. With the help of ByteStrings I'm down to 4min processing
time in constant space. However, when I try to wrap the parse results
in a data structure, the heap blows up - even though I never actually
inspect the structure being built! This bugs me, so I come here looking
for answers.

Parser follows:

> module Main where
> 
> import qualified Data.ByteString.Lazy.Char8 as B
> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Pos
> import System.Environment
> import System.Exit
> import qualified Data.Map as M
> import Lexer
> 
> type XdlParser a = GenParser Token XdlState a
> 
> -- Parser state
> type XdlState = Counts
>     
> type Counts = M.Map Count Integer
> 
> data Count = ListCount | SymbolCount
>     deriving (Eq, Ord, Show)
> 
> emptyXdlState = M.empty
> 
> incCount :: Count -> (Counts -> Counts)
> incCount c = M.insertWith' (+) c 1
> 
> -- handling tokens
> myToken      :: (Token -> Maybe a) -> XdlParser a
> myToken test  = token showTok posTok testTok
>     where
>         showTok = show
>         posTok  = const (initialPos "")
>         testTok = test
> 
> -- Syntax of expressions
> data Exp = Sym !B.ByteString | List ![Exp]
>     deriving (Eq, Show)
> 
> expr =  list <|> symbol
> 
> rparen = myToken $ \t -> case t of
>             RParen  -> Just ()
>             other   -> Nothing
> 
> lparen = myToken $ \t -> case t of
>             LParen  -> Just ()
>             other   -> Nothing
> 
> name = myToken $ \t -> case t of
>             Name n -> Just n
>             other  -> Nothing
> 
> list = do
>     updateState $ incCount ListCount
>     lparen
>     xs <- many1 expr
>     rparen
>     return ()
> --  return $! (List xs)
> 
> symbol = do
>     updateState $ incCount SymbolCount
>     name >> return ()
> --  Sym `fmap` name
> 
> -- Top level parser
> top :: XdlParser XdlState
> top =  do
>     l <- many1 list
>     eof
>     getState
> 
> main = do
>     args <- getArgs
>     case args of
>         [fname] -> do
>             text <- B.readFile fname
>             let result = runParser top emptyXdlState fname (tokenize text)
>             putStrLn $ either show show result
>         _ -> putStrLn "usage: parse filename" >> exitFailure

And the Lexer:

> module Lexer (Token(..), tokenize) where
> 
> import qualified Data.ByteString.Lazy.Char8 as B
> import Control.Monad
> import Data.Char
> import Data.List
> import System.Environment
> import System.Exit
> 
> data Token = LParen
>            | RParen
>            | Name B.ByteString
>     deriving (Ord, Eq, Show)
> 
> type Input = B.ByteString
> 
> -- Processor returns Nothing if it can't process the Input
> type Processor = Input -> Maybe ([Token], Input)
> 
> -- Tokenize ends the list when all processors return Nothing
> tokenize :: Input -> [Token]
> tokenize  = concat . unfoldr processAll
>     where
>         processors    = [doSpaces, doComment, doParens, doName]
>         processAll   :: Processor
>         processAll bs = if B.null bs 
>                             then Nothing
>                             else foldr mminus Nothing $ map ($ bs) processors
>         mminus a@(Just _) _ = a
>         mminus Nothing    b = b
> 
> doSpaces    :: Processor
> doSpaces bs =
>         if B.null sp
>             then Nothing
>             else Just ([], nsp)
>     where
>         (sp, nsp) = B.span isSpace bs
> 
> doComment    :: Processor
> doComment bs =
>     if B.pack "# " `B.isPrefixOf` bs
>         then Just ([], B.dropWhile (/= '\n') bs)
>         else Nothing
> 
> doParens     :: Processor
> doParens bs  = case B.head bs of
>     '(' -> Just ([LParen], B.tail bs)
>     ')' -> Just ([RParen], B.tail bs)
>     _   -> Nothing
> 
> doName       :: Processor
> doName  bs   =
>         if B.null nsp
>             then Nothing
>             else Just ([Name nsp], sp)
>     where
>         (nsp, sp) = B.span (not . isRest) bs
>         isRest c = isSpace c || c == ')' || c == '('

Regards,
-- 
Krzysztof Kościuszkiewicz
Skype: dr.vee,  Gadu: 111851,  Jabber: kokr at jabberpl.org
"Simplicity is the ultimate sophistication" -- Leonardo da Vinci


More information about the Haskell-Cafe mailing list