ANN: regex-* updates and regex-tdfa 0.20 first release

Mon Jan 22 12:17:44 EST 2007

Greetings.  I have several bits of news about regex-* packages.

  First: I have fixed a small bug in one instance of RegexContext, defined in
Text.Regex.Contex in the regex-base class (included in GHC).  The new version of
regex-base is 0.80 and is available at
http://darcs.haskell.org/packages/regex-unstable/regex-base/.  The instance
fixed was
> instance (RegexLike a b) => RegexContext a b (MatchResult b) where
and I doubt this affects anyone at the moment.

  Second: There is also a new version of regex-tre at
http://darcs.haskell.org/packages/regex-unstable/regex-tre/ (version
0.81) that exposes the 'compRightAssoc' option when making a Regex
(see Text.Regex.TRE.Wrap for all the options).

  Third: I wish to note that that the libtre backend (up to the
current version 0.7.5) that regex-tre uses is buggy and that regex-tre
/ libtre should only be used where testing shows that no bugs are
being trigged.  (Why go to that much trouble?  Because the libtre
backend is very very fast).

  Fourth: The fabulous new all Haskell POSIX compatible backend!

  This is a first announcement of regex-tdfa, my "tagged" DFA regular
expression backend.  It is an implementation of regular expressions in
Haskell; it does not hand off the data to a library written in c.  The
ultimate goal of regex-tdfa is to replace regex-posix as the default
engine that come with GHC.  Why go to the trouble of replacing
regex-posix?  Because regex-posix is very very slow (and not quite
POSIX, as seen at the end of this message).

  The new regex-tdfa does not use the slow backtracking method like my
pure Haskell backend regex-posix.  It uses a deterministic finite
automata, similar in spirit to one I derived from CTKlight for use in
the pure Haskell regex-dfa backend (which is LGPL -- the rest are
BSD-3 license).  But regex-dfa is fast at a cost of not returning any
parenthesized subexpression captures (i.e only \0 and not \1 \2 \3
...).

  The regex-tre backend (BSD3) that uses the libtre c library (LGPL)
employs a special "tagged" dfa technique to get subexpression captures
while still using an efficient DFA.  The regex-posix engine that comes
with GHC is impossibly slow but regex-tre is very very fast and
(claims to also be) a POSIX engine. I was thus inspired to create
regex-tdfa which aims to create a pure Haskell variant of this
tagged-DFA algorithm under a simple BSD3 license.

  The 0.20 version of regex-tdfa that I am announcing is available at
http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/ and
requires the updated regex-base (version 0.80 or greater) and is
setup to compile for GHC 6.6.  These could be back-ported to GHC 6.4
by adding the fps package to GHC 6.4 to get Data.ByteString and
removing references to Data.Sequence in the new regex-base and
regex-tdfa packages.

  As of now, my first untuned version of the regex-tdfa backend
performs about 2-3 times slower than regex-tre and the libtre c
library.  This also means regex-tdfa performs 3+ times faster than the
regex-parsec engine.  So the goal of a fast replacement for
regex-posix is close to hand.

  The bad news: libtre has two problems, the first is that it is
buggy.  I have found patterns that the libtre engine does not handle
correctly: the wrong match for the whole expression and/or
subexpressions is returned (or it will fail to match when it should
have succeeded).  The second problem is that it does not always follow
the POSIX standard policy in determining which possible parenthesized
subexpession captures to return when there are several possibilities.

  My regex-tdfa engine does not have the bugs that I have found in
libtre.  But since it uses the same design concepts as libtre it had
the same problem with not following the POSIX standard policy for
capturing subexpressions.  To fix this I have exteneded the tagging
technique to keep enough information to follow the POSIX spec.  So of
all the regex-* backends, the regex-tdfa is the only one that passes
all the tests that I have for POSIX compatibility.

  End note: The regex-posix engine, despite its name, does not follow the POSIX
standard policy for capturing subexpressions.  Example 1 of 2:
"ab" =~ "((a?)((ab)?))(b?)"
This should capture the sub ranges (0,2)(0,2)(0,0)(0,2)(0,2)(2,2)
But it does capture the sub ranges (0,2)(0,1)(0,1)(1,1)(-1,-1)(1,2)
The first group is (0,2) means the whole match was \0=="ab"
The second group (0,2) or (0,1) means the first group "((a?)((ab)?))" should
match \1=="ab" but it returns \1=="a".  This first group should have been chosen
for maximum length and the failure to do so is a policy violation.

Example 2 of 2:
"x" =~ "(.?)*"
This should capture: (0,1)(0,1)
But it does capture: (0,1)(1,1)
The first group of (0,1) means the regex did match the whole \0=="x".
But the regex-posix engine matched an extra iteration of the "(.?)" under the
"*" operator against the empty string after the "x" and returned the empty
capture of (1,1).  This highly non-useful behavior violates POSIX policy.

My regex-parsec engine follows a non-POSIX policy, in which it
maximizes the length of just the captured subexpressions instead of
all the subpatterns of the regular expression.  This is useful and was
easy to write and is easy to explain and agrees with the policy of
other libraries such as Boost.Regex for C++ ( as in the third FAQ at
http://www.boost.org/libs/regex/doc/faq.html )

But to replace regex-posix I need a fast POSIX policy engine, which is
what regex-tnfa has now achieved.

I will submit regex-tdfa to replace regex-posix once I have a stable
version that has been further cleaned up and tuned.  Anyone who wishes
a useful performance time/spacce tuning challenge is free to help.  If
you have any evil or stressful examples of regular expressions, then I
welcome adding them to my tests.

-- 
Chris Kuklewicz