[Haskell-cafe] Re: Fastest regex package?

Thu Feb 5 10:54:00 EST 2009

I was wrong about BoyerMoore: it works about 2x faster than PCRE for
this example, so seems like it does the trick (although I suspect that
one could do even faster, so the question remains).

2009/2/5 Eugene Kirpichov <ekirpichov at gmail.com>:
> Hello.
>
> What is the fastest regex package of the multitude of packages present
> on Hackage (namely, PCRE, PCRE-light, DFA, TDFA and many more)?
>
> My benchmark (parsing a huge logfile with a regex like "GET
> /foo.xml.*fooid=([0-9]++).*barid=([0-9]++)") shows that plain PCRE is
> the fastest one (I tried PCRE, PCRE-light and TDFA; DFA can't do
> capturing groups at all, TDFA was abysmally slow (about 20x slower
> than PCRE), and it doesn't support ++), but maybe have I missed any
> blazing-fast package?
>
> Even PCRE seems somewhat too slow to me - matching this regex eats 50%
> of the program time; considering that the only things it does is
> ungzipping the log and searching for the regex, this seems like far
> too much, the regex is not that complex to my mind. I tried to replace
> this regex by Data.ByteString.Search.KnuthMorrisPratt and BoyerMoore
> from stringsearch, but they work even twice slower.
>
> I also noticed that PCRE does about 2x fewer allocations than
> PCRE-Light, but is not much faster.
> However, PCRE, strangely, has a non-pure interface, I had to use it
> with unsafePerformIO.
>
> It's a pity I can't put these findings on haskellwiki to
> http://www.haskell.org/haskellwiki/Regular_expressions since I don't
> have an account and new accounts can't be created right now. Could
> anyone put them there?
>
> All in all, my question remains: what is the fastest way to do this
> kind of parsing on a lazy bytestring?
>