[Haskell-cafe] Difference between Lazy ByteStrings and Strings in alex

Tue Feb 13 22:54:56 EST 2007

On Tue, Feb 13, 2007 at 10:43:11PM -0500, Jefferson Heard wrote:
> I am running GHC 2.6 now, and am using -O3 as my optimization parameter.  I'm 

I think you will get much better performance with GHC 6.6.  The optimizer has been
improved a *lot* in the last 10 years.

(I hope that was a typo!!)

> Non-lazy version
> 
> {
> module Main
>     where
> 
> import qualified FileReader
> 
> }
> 
> %wrapper "basic"
> 
> $letter = [a-zA-Z]
> $digit = 0-9
> $alphanum = [a-zA-Z0-9]
> $punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \' 
> \" \, \. \? \/ \` \~]
> $dec = \.
> $posneg = [\- \+]
> 
> @date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | may?\ $digit{1,2}(\,\ $digit{2,4})?
>        | jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
>        | dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
> 
> @date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
> 
> @time = $digit{1,2} \: $digit{2} (am|pm)?
> 
> @word = $alphanum+
> 
> @number = $posneg? $digit+ 
>         | $posneg? $digit+ $dec $digit+
>         | $posneg? $digit+ (\,$digit{3})+
>         | $posneg? $digit? (\,$digit{3})+ $dec $digit+
> 
> $white = [\t\r\n\v\f\ ]
> 
> @doc = \< DOC \>
> @tag = \< $alphanum+ \>
>      | \<\/ $alphanum+ \>
> 
> tokens :- 
>   @doc    { \s -> "" }
>   @tag    ;
>   $white+ ; 
>   @time   { \s -> s }
>   @number { \s -> s } 
>   @word   { \s -> s }
>   $punct  ; 
>   .       ;
> 
> {
> 
> printCount c [] = print c
> printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
> 
> main = do
>     file <- readFile "trecfile1" 
>     printCount 0 (alexScanTokens file) 
>  
> }

FTR, regular strings are lazy - too lazy, which is where the performance problems come from.

> -- ------------------------------------------------------------------------------------------------------------
> Version depending on ByteString.Lazy -- note that the grammar is the same, so 
> it has been omitted
> -- ------------------------------------------------------------------------------------------------------------
> 
> ... grammar ...
> 
> {
> type AlexInput = (Char,     -- previous char
>                   B.ByteString)   -- current input string
> 
> takebytes :: Int -> B.ByteString -> String
> takebytes (0) _ =  ""
> takebytes n s = c : takebytes (n-1) cs
>     where c = B.index s 0
>           cs = B.drop 1 s
>
> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring) 
>     | bytestring == B.empty = Nothing
>     | otherwise             = Just (c , (c,cs))
>     where c = B.index bytestring 0
>           cs = B.drop 1 bytestring

Hm, you might do better with more specialized functions.

> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring) 
>     | B.null bytestring = Nothing
>     | otherwise         = Just (c , (c,cs))
>     where c = B.head bytestring
>           cs = B.tail bytestring

or even:

> alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
> alexGetChar (_, bytestring) 
>     | B.null bytestring = Nothing
>     | otherwise         = Just (c , (c,cs))
>     where c = B.unsafeHead bytestring
>           cs = B.unsafeTail bytestring

> alexInputPrevChar :: AlexInput -> Char
> alexInputPrevChar (c,_) = c

If you are certian this isn't the first character, you might do better using B.unsafeIndex (-1).

> alexScanTokens :: B.ByteString -> [String]
> alexScanTokens str = go ('\n',str)
>   where go inp@(_,str) =
>           case alexScan inp 0 of
>                 AlexToken inp' len act -> act (takebytes len str) : go inp'
>                 AlexSkip  inp' len     -> go inp'                
>                 AlexEOF -> []
>                 AlexError _ -> error "lexical error"
>                 
> 
> 
> 
> printCount :: Int -> [String] -> IO ()
> printCount c [] = print c
> printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
> 
> main = do
>     file <- B.readFile "trecfile1" 
>     printCount 0 (alexScanTokens file) 
>  
> }