[Haskell-cafe] Annoucing alternative to Text.Regex

Tue Mar 7 17:26:37 EST 2006

This message is cross-posted to haskell-cafe at haskell.org and libraries at haskell.org

Hi,

  Working on the shootout (specifically
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all ), it
was impossible to use Text.Regex since it was too slow (it would reach the
timeout limit, such as an hour, where TCL takes 3.21 seconds ).

  So we had to work around this by implementing our own regex engine.  I have
tooled up an all-parsec version that I think can be a nearly drop-in-alternative
to Text.Regex.  I will describe it a bit and then ask for advice.

It takes the string form of the regular expression and uses Parsec to create a
(data Pattern) tree structure.  This recognizes all of "man re_format" including
back references, missing only the [[:>]] or [[<:]] extension (word boundaries).
 It also accepts new modifiers '?' and '+' to be used like "a*?" and "a*+" for
lazy and possessive matching.

The Pattern form is amenable to simplification, and some easy simplification are
performed.

Then the Pattern is transformed in a Parsec parser that will match the input
(from the beginning only) and return all the captured sub-expressions.  It also
supports (get|set|update)UserState functions.  It should return the longest
match when all the pattern elements are greedy, i.e. no lazy or possessive
modifiers are used.  With lazy or possessive modifiers the longest alternative
is used, but this may not be the longest overall match.  (Parsec keeps track of
line and column position, but not the number of characters, which had to be added.)

Note that the pattern "(.)Q\1" can be used to match strings like "aQa", and
"(a|b\1)+" matches all of "ababba".

The Text.Regex compatibility code embeds the generated Parsec parser into more
complicated parsers and runs them to emulate the different functions in the
Text.Regex interface.

It is possible to extend all of this in several ways:
  1) Turn the Pattern into simpler parsec, e.g. if the expression captures are
not needed
  2) When possible, turn the Pattern into a faster DFA parser instead of Parsec
  3) Create parsers that with different semantics than "longest match", e.g.
left-biased
  4) Creates parsers that return all matches
  5) Employ smart optimizations for special forms of Pattern
  6) With an earlier version, I made a proof of principle String -> ExpQ
transformer that created Parsec at compile time using Template Haskell.  This
could be updated.

Not done: testing.  (If you have clever things to throw at it, pass them along).
 Not done: benchmarking (though it should usually beat Text.Regex).

And now I want to ask for advice.  If there is interest I could post the code
somewhere, but I don't own a stable place to put it at the moment.  Is there a
canonical place on the net to put small Haskell modules (28 kB uncompressed) ?
Is there some darcs-repository I could insert it into ?

Also, is there a library style guide somewhere with best practices?