[Haskell-cafe] Difference between Lazy ByteStrings and Strings in
alex
Stefan O'Rear
stefanor at cox.net
Tue Feb 13 22:54:56 EST 2007
On Tue, Feb 13, 2007 at 10:43:11PM -0500, Jefferson Heard wrote:
> I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm
I think you will get much better performance with GHC 6.6. The optimizer has been
improved a *lot* in the last 10 years.
(I hope that was a typo!!)
> Non-lazy version
>
> {
> module Main
> where
>
> import qualified FileReader
>
> }
>
> %wrapper "basic"
>
> $letter = [a-zA-Z]
> $digit = 0-9
> $alphanum = [a-zA-Z0-9]
> $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \'
> \" \, \. \? \/ \` \~]
> $dec = \.
> $posneg = [\- \+]
>
> @date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})?
> | feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})?
> | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})?
> | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})?
> | may?\ $digit{1,2}(\,\ $digit{2,4})?
> | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})?
> | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})?
> | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})?
> | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})?
> | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})?
> | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})?
> | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
> | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
>
> @date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
>
> @time = $digit{1,2} \: $digit{2} (am|pm)?
>
> @word = $alphanum+
>
> @number = $posneg? $digit+
> | $posneg? $digit+ $dec $digit+
> | $posneg? $digit+ (\,$digit{3})+
> | $posneg? $digit? (\,$digit{3})+ $dec $digit+
>
> $white = [\t\r\n\v\f\ ]
>
> @doc = \< DOC \>
> @tag = \< $alphanum+ \>
> | \<\/ $alphanum+ \>
>
> tokens :-
> @doc { \s -> "" }
> @tag ;
> $white+ ;
> @time { \s -> s }
> @number { \s -> s }
> @word { \s -> s }
> $punct ;
> . ;
>
> {
>
> printCount c [] = print c
> printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
>
> main = do
> file <- readFile "trecfile1"
> printCount 0 (alexScanTokens file)
>
> }
FTR, regular strings are lazy - too lazy, which is where the performance problems come from.
> -- ------------------------------------------------------------------------------------------------------------
> Version depending on ByteString.Lazy -- note that the grammar is the same, so
> it has been omitted
> -- ------------------------------------------------------------------------------------------------------------
>
> ... grammar ...
>
> {
> type AlexInput = (Char, -- previous char
> B.ByteString) -- current input string
>
> takebytes :: Int -> B.ByteString -> String
> takebytes (0) _ = ""
> takebytes n s = c : takebytes (n-1) cs
> where c = B.index s 0
> cs = B.drop 1 s
>
> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring)
> | bytestring == B.empty = Nothing
> | otherwise = Just (c , (c,cs))
> where c = B.index bytestring 0
> cs = B.drop 1 bytestring
Hm, you might do better with more specialized functions.
> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring)
> | B.null bytestring = Nothing
> | otherwise = Just (c , (c,cs))
> where c = B.head bytestring
> cs = B.tail bytestring
or even:
> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring)
> | B.null bytestring = Nothing
> | otherwise = Just (c , (c,cs))
> where c = B.unsafeHead bytestring
> cs = B.unsafeTail bytestring
> alexInputPrevChar :: AlexInput -> Char
> alexInputPrevChar (c,_) = c
If you are certian this isn't the first character, you might do better using B.unsafeIndex (-1).
> alexScanTokens :: B.ByteString -> [String]
> alexScanTokens str = go ('\n',str)
> where go inp@(_,str) =
> case alexScan inp 0 of
> AlexToken inp' len act -> act (takebytes len str) : go inp'
> AlexSkip inp' len -> go inp'
> AlexEOF -> []
> AlexError _ -> error "lexical error"
>
>
>
>
> printCount :: Int -> [String] -> IO ()
> printCount c [] = print c
> printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
>
> main = do
> file <- B.readFile "trecfile1"
> printCount 0 (alexScanTokens file)
>
> }
More information about the Haskell-Cafe
mailing list