[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0
Sebastian Fischer
sebf at informatik.uni-kiel.de
Wed Jul 28 08:39:30 EDT 2010
> Have you thought about supporting anchors like (^) and ($) ?
We went the opposite route, made full matching the default, and
implemented partial matching by pre- and appending .*
As there are both a fullMatch and a partialMatch function, I don't see
an immediate need for anchors, although I admit that they have the
advantage that you can specify *in the regexp* whether you want full
or partial matching.
> You also wrote
>> I also want to add substring matching,
>> i.e., the possibility to find out against which strings parenthesized
>> parts of the regexp were matched.
>
> Getting a good specification for the substring rules is non-trivial.
> When writing regex-tdfa (on hackage) I consulted Glenn Fowler's
> description at
>
>> http://www2.research.att.com/~gsf/testregex/re-interpretation.html
>
> which are well-defined rules for POSIX substring capture.
Thanks for the pointer. I'll consult it when going for submatch
extraction.
> In the spirit of your DFA I am guessing the "mark" type for substring
> capture would actually be an annotated copy of the whole regular
> expression tree.
I am not sure whether it is a good idea to implement submatch capture
only by means of a different semiring. I would not hesitate to extend
the matching algorithm itself, if that leads to a clean implementation.
> But regex-tdfa is quite old and over-complicated and while it is
> linear
> in complexity it is not especially fast. I am currently rewriting the
> algorithm in OCaml (to learn OCaml) and I am trying to avoid the
> /a{2,5}/ blowup that regex-tdfa has.
Did you experience problems with such blowup? The re2 library also
implements bounded repetitions by blowing them up, so, at least, it
doesn't seem to be a big problem for Google.
Cheers,
Sebastian
--
Underestimating the novelty of the future is a time-honored tradition.
(D.G.)
More information about the Haskell-Cafe
mailing list