ANN: regex-tdfa 0.56

Sat Jan 27 18:18:20 EST 2007

Greetings,

  My new regular expression backend "regex-tdfa" is passing all the tests I have
been throwing at it, so it is time to share the good news.

First up: Where is it? Version 0.56 is in my "stable" location at
> darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-tdfa
The version that will get updated and broken more often is "unstable" at
> darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-unstable/regex-tdfa

It needs a slightly updated regex-base package (version >= 0.80) at
> darcs get --partial darcs.haskell.org:/home/darcs/packages/regex-unstable/regex-base

It has only been tested (on Max OS X) with GHC 6.6 and depends on base >= 2.0
for Data.ByteString, Data.ByteString.Lazy, and Data.Sequence.

** What does regex-tdfa do?

  It runs POSIX regular expression searches correctly.  No other regular
expression backend for Haskell does this correctly.  The regular expression can
be chosen to be from a String, ByteString, ByteString.Lazy, or (Data.Seq Char).
 Independent of that choice the searched text can also be any of those types.

** Properties and benefits of regex-tdfa:

  It follows a known standard: POSIX. (If you find that it disagrees with the
standard then I will fix it.)

  It is written in Haskell, so there is no C library to compile and install.
Though it is currently only written for GHC 6.6, it could be ported to any other
compiler with only minor changes.  (I expect that if you drop the polymorphic
MPTC interface of regex-base it could be made Haskell98).

  It works with Unicode since it can be used on [Char] (and it works if there
are '\NUL' characters around).

  It is lazy in the regular expression: it only builds the DFA if you use it,
and only builds the state it needs, which are memoized in a lazy Trie.

  It is lazy in the searched text, and the pure function that searches the text
only evaluates the text until the next match is done and does not hold onto the
initial text.  Thus you could efficiently search a [Char] up to the size of the
Position type, which is an Int for now.

  It is free, the regex-tdfa library is under a BSD-3 license.

** Does it work?

  Yes.

  The regex-posix backend that come with GHC does not run all the searches
correctly: it is buggy (example at end of announcement).  And the regex-posix
backend is also impossibly slow -- the slowest backend by far -- orders of
magnitude slower.  The regex-tre backend that wraps the TRE library has exposed
this very fast library to the harsh gaze of QuickCheck.  The lesson from
QuickCheck is to never use regex-tre because the TRE library (version 0.7.4) is
buggy.

  For regex-tdfa backend I took/stole ideas from Ville Laurikari's Master thesis
which describes the algorithm that TRE uses.  And I made a pure Haskell
implementation that now passes all the tests at
http://www.research.att.com/~gsf/testregex/testregex.html and follows the notes
at http://www.research.att.com/~gsf/testregex/re-interpretation.html
and passes those tests as well.  And it "passes" QuickCheck tests which actually
just compare the answers that regex-posix, regex-tre, and regex-tdfa since I do
not have a bug free reference implementation to compare against.

** Why the name TDFA?

  Tagged Deterministic Finite Automata.  The Tagged part is from Ville
Laurikari's thesis work.

** Speed:

  If you do not care only about the match as a whole and not about subexpression
matching then you can set the captureGroups ExecOption to False:

makeMyRegex = makeRegex defaultCompOpt myExecOpt
  where myExecOpt = defaultExecOpt { captureGroups = False }

And redefine (=~) and (=~~) to use makeMyRegex instead of makeRegex:

(=~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex
source1 target) => source1 -> source -> target
(=~) s r = let re = makeMyRegex r :: Regex in match re s

(=~~) :: (RegexMaker Regex CompOption ExecOption source,RegexContext Regex
source1 target,Monad m)=> source1 -> source -> m target
(=~~) s r = let re = makeMyRegex r :: Regex in match re s

The benefit to this is speed, as it then proceeds to run all matches of
"b(..?c)?d" against a large test file *faster* than the the TRE C library
backend.  My compliments to those who worked on ghc and -O2.  It even runs
nearly as fast as regex-dfa in this mode.  Thus regex-dfa is (mostly) obsolete.

If you use the default and capture the subgroups then the speed is reduced by a
factor of about 4.  This is because of the very simple data structures that keep
track of the run time state.  I intend to improve this aspect of the design now
that I have a working algorithm.

I have not benchmarked the speed of compiling the regular expression from the
String. This compiling is done with very lazy data structures, so the DFA size
is really only as big as you actually explore while running the match.

** Examples of bugs in regex-posix and regex-tre:

An example for regex-posix:

> *Main> "ab" Posix.=~ "((a)|(b)){2,}" :: MatchArray
> array (0,3) [(0,(0,2)),(1,(1,1)),(2,(0,1)),(3,(1,1))

The array holds (group #, (match offset, match length))

The 0th group is the whole match "ab". The 2nd group is (a) and it matched the
first "a" and the third group is (b) and matched the "b".  But the second group
is a child of the first group ((a)|(b)) which matched "b".  The standard
requires that captured groups are nested inside their parents, and here "a" is
not not nested inside "b".

So the regex-posix C backend that comes with GHC is buggy.

The correct answer is, where the second group matched nothing:
> *Main> "ab" TDFA.=~ "((a)|(b)){2,}" :: MatchArray
> array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(1,1))]

An example for regex-tre:

"yyyyyy" TRE.=~ "(yyy|(x?)){2,4}"

array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("yyy",(3,3)))],"")

Here TRE reports that the second group (x?) which should only match "" or "x"
somehow matched "yyy".  This is even more wrong than the regex-posix bug above.

The correct answer, where the second group does not match:
array (0,2) [(0,("yyyyyy",(0,6))),(1,("yyy",(3,3))),(2,("",(-1,0)))],"")

-- 
Chris Kuklewicz