[Haskell-cafe] Re: Fastest regex package?

Sat Feb 7 03:44:38 EST 2009

2009/2/5 ChrisK <haskell at list.mightyreason.com>:
> Eugene Kirpichov wrote:
>>>
>>> All in all, my question remains: what is the fastest way to do this
>>> kind of parsing on a lazy bytestring?
>>>
>
> Your example regular expression works the same in both Posix and Perl-ish
> semantics.   Do you know the difference?  Posix libraries look for the
> longest match of all possible matches.  Perl-ish is left-bias and looks at
> the left branch first and only looks at the right branch when the left fails
> and it has to backtrack.

Thanks, I didn't know that about posix.

>
> So the "++" operator is a hack to try and control the backtracking of Perl
> regular expressions.  Such a things has no meaning in Posix where the
> implementation details are totally different.
>

That's precisely what I put it in for :)

> You might try this variant of the example pattern:
>  /foo.xml.*fooid=([0-9]+)[^0-9].*barid=([0-9]+)
>
> The [^0-9] can be used when you know that there is at least one junk
> character before the barid, which I suspect will always occur in a URL.
>
> I expect regex-posix to be slower than regex-pcre. I have not used the new
> pcre-light.  I wrote regex-tdfa — it is pure haskell and not a C library
> wrapper.  There are patterns where regex-pcre will backtrack and take
> exponentially more time than regex-tdfa's automaton (which is not yet ideal
> and may get faster).

You're right :)

>
> So what is the lazy bytestring with its multiple buffers doing for you when
> using PCRE, PCRE-light, or regex-posix? Absolutely nothing.  To run against
> these C libraries the target text is converted to a single buffer, i.e. a
> CStringLen in Haskell.  Thus it is morally converted into a strict
> bytestring. This may involve copying the logfile into a NEW strict
> bytestring EVERY TIME you run a match.  Please Please Please convert to a
> strict bytestring and then run regex-pcre or pcre-light (or any of the
> others).
>

I'm running the regex on lines of the logfile, not on the whole
logfile itself; and I'm running it only once per line, so that doesn't
matter much.
Actually, that's what I did in the initial version of the program
(strictify lines before matching), but then I experimentally got rid
of the extra conversion function and performance didn't change at all
(anyways, someone was converting that string exactly once, be it me or
the package), so I left it that way.

> regex-tdfa does not convert it into a strict bytestring, but is otherwise
> much slower than pcre for your simple pattern.
>
> As for regex-pcre's interface....you should use the API in regex-base to get
> a pure interface.   The RegexLike functions are the pure interface for this,
> and the RegexContext class offers a slew of instances with useful variants.

Thanks, I didn't know that.

>  But if you have been getting to the "low level IO API" in regex-pcre then
> you probably do not need or want the RegexContext transformations.

Is that because of their performance?

>
> And BoyerMoore (which I think I helped optimize): this may be faster because
> it does not copy your whole Lazy bytestring into a Strict ByteString for
> each search.  But you may wish to test it with a Strict ByteString as input
> anyway.
>
> --
> Chris
>