[Haskell-cafe] Difference between Lazy ByteStrings and Strings in
alex
Jefferson Heard
jeff at renci.org
Tue Feb 13 22:43:11 EST 2007
It was suggested that I might derive some performance benefit from using lazy
bytestrings in my tokenizer instead of regular strings. Here's the code that
I've tried. Note that I've hacked the "basic" wrapper code in the Lazy
version, so the code should be all but the same. The only thing I had to do
out of the ordinary was write my own 'take' function instead of using the
substring function provided by Data.Lazy.ByteString.Char8. The take function
I used was derived from the one GHC uses in GHC.List and produces about the
same code.
The non-lazy version runs in 38 seconds on a 211MB file versus the lazy
versions 41 seconds. That of course doesn't seem like that much, and in the
non-lazy case, I have to break the input up into multiple files, whereas I
don't have to in the lazy version -- this does not take any extra time. The
seconds do add up to a couple of hours for me, though once I'm done, and so
I'd like to understand why, when the consensus was that Data.ByteString.Lazy
might give me better performance in the end, it doesn't do so here.
I am running GHC 2.6 now, and am using -O3 as my optimization parameter. I'm
profiling the code now, but was wondering if there was any insight...
-- Jeff
Non-lazy version
{
module Main
where
import qualified FileReader
}
%wrapper "basic"
$letter = [a-zA-Z]
$digit = 0-9
$alphanum = [a-zA-Z0-9]
$punct = [\! \@ \# \$ \% \^ \& \* \( \) \_ \- \+ \= \{ \[ \} \] \\ \| \; \: \'
\" \, \. \? \/ \` \~]
$dec = \.
$posneg = [\- \+]
@date1 = jan($punct|uary)?\ $digit{1,2}(\,\ $digit{2,4})?
| feb($punct|ruary)?\ $digit{1,2}(\,\ $digit{2,4})?
| mar($punct|ch)?\ $digit{1,2}(\,\ $digit{2,4})?
| apr($punct|il)?\ $digit{1,2}(\,\ $digit{2,4})?
| may?\ $digit{1,2}(\,\ $digit{2,4})?
| jun($punct|e)?\ $digit{1,2}(\,\ $digit{2,4})?
| jul($punct|y)?\ $digit{1,2}(\,\ $digit{2,4})?
| aug($punct|ust)?\ $digit{1,2}(\,\ $digit{2,4})?
| sep($punct|tember)?\ $digit{1,2}(\,\ $digit{2,4})?
| sept($punct)?\ $digit{1,2}(\,\ $digit{2,4})?
| oct($punct|ober)?\ $digit{1,2}(\,\ $digit{2,4})?
| nov($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
| dec($punct|ember)?\ $digit{1,2}(\,\ $digit{2,4})?
@date2 = $digit{1,2} $punct $digit{1,2} $punct $digit{2,4}
@time = $digit{1,2} \: $digit{2} (am|pm)?
@word = $alphanum+
@number = $posneg? $digit+
| $posneg? $digit+ $dec $digit+
| $posneg? $digit+ (\,$digit{3})+
| $posneg? $digit? (\,$digit{3})+ $dec $digit+
$white = [\t\r\n\v\f\ ]
@doc = \< DOC \>
@tag = \< $alphanum+ \>
| \<\/ $alphanum+ \>
tokens :-
@doc { \s -> "" }
@tag ;
$white+ ;
@time { \s -> s }
@number { \s -> s }
@word { \s -> s }
$punct ;
. ;
{
printCount c [] = print c
printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do
file <- readFile "trecfile1"
printCount 0 (alexScanTokens file)
}
-- ------------------------------------------------------------------------------------------------------------
Version depending on ByteString.Lazy -- note that the grammar is the same, so
it has been omitted
-- ------------------------------------------------------------------------------------------------------------
... grammar ...
{
type AlexInput = (Char, -- previous char
B.ByteString) -- current input string
takebytes :: Int -> B.ByteString -> String
takebytes (0) _ = ""
takebytes n s = c : takebytes (n-1) cs
where c = B.index s 0
cs = B.drop 1 s
alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
alexGetChar (_, bytestring)
| bytestring == B.empty = Nothing
| otherwise = Just (c , (c,cs))
where c = B.index bytestring 0
cs = B.drop 1 bytestring
alexInputPrevChar :: AlexInput -> Char
alexInputPrevChar (c,_) = c
alexScanTokens :: B.ByteString -> [String]
alexScanTokens str = go ('\n',str)
where go inp@(_,str) =
case alexScan inp 0 of
AlexToken inp' len act -> act (takebytes len str) : go inp'
AlexSkip inp' len -> go inp'
AlexEOF -> []
AlexError _ -> error "lexical error"
printCount :: Int -> [String] -> IO ()
printCount c [] = print c
printCount c (l:ls) = if l == "" then printCount (c+1) ls else printCount c ls
main = do
file <- B.readFile "trecfile1"
printCount 0 (alexScanTokens file)
}
More information about the Haskell-Cafe
mailing list