[Haskell-cafe] Re: Fastest regex package?
ChrisK
haskell at list.mightyreason.com
Thu Feb 5 13:59:12 EST 2009
Eugene Kirpichov wrote:
>> All in all, my question remains: what is the fastest way to do this
>> kind of parsing on a lazy bytestring?
>>
Your example regular expression works the same in both Posix and Perl-ish
semantics. Do you know the difference? Posix libraries look for the longest
match of all possible matches. Perl-ish is left-bias and looks at the left
branch first and only looks at the right branch when the left fails and it has
to backtrack.
So the "++" operator is a hack to try and control the backtracking of Perl
regular expressions. Such a things has no meaning in Posix where the
implementation details are totally different.
You might try this variant of the example pattern:
/foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)
The [^0-9] can be used when you know that there is at least one junk character
before the barid, which I suspect will always occur in a URL.
I expect regex-posix to be slower than regex-pcre. I have not used the new
pcre-light. I wrote regex-tdfa — it is pure haskell and not a C library
wrapper. There are patterns where regex-pcre will backtrack and take
exponentially more time than regex-tdfa's automaton (which is not yet ideal and
may get faster).
So what is the lazy bytestring with its multiple buffers doing for you when
using PCRE, PCRE-light, or regex-posix? Absolutely nothing. To run against
these C libraries the target text is converted to a single buffer, i.e. a
CStringLen in Haskell. Thus it is morally converted into a strict bytestring.
This may involve copying the logfile into a NEW strict bytestring EVERY TIME you
run a match. Please Please Please convert to a strict bytestring and then run
regex-pcre or pcre-light (or any of the others).
regex-tdfa does not convert it into a strict bytestring, but is otherwise much
slower than pcre for your simple pattern.
As for regex-pcre's interface....you should use the API in regex-base to get a
pure interface. The RegexLike functions are the pure interface for this, and
the RegexContext class offers a slew of instances with useful variants. But if
you have been getting to the "low level IO API" in regex-pcre then you probably
do not need or want the RegexContext transformations.
And BoyerMoore (which I think I helped optimize): this may be faster because it
does not copy your whole Lazy bytestring into a Strict ByteString for each
search. But you may wish to test it with a Strict ByteString as input anyway.
--
Chris
More information about the Haskell-Cafe
mailing list