Policy change for regex libraries
Chris Kuklewicz
haskell at list.mightyreason.com
Mon Jan 12 18:16:15 EST 2009
mail at justinbogner.com wrote:
> Policy 2 seems the most intuitive to me, I can't imagine a situation
> where you would want the empty match at the end of a non-empty match. On
> the other hand, I don't think I've ever used a regular expression that
> matched an empty string, so it's not particularly important to me.
The authors of sed are in agreement with your intuition. But I think policy 2
as a recursive definition it is unusual. And I see policy 2 as very hard to
summarize.
The single-match is always easy to summarize: the leftmost longest match.
Policy 1 for multiple matches can be summarized as:
> All leftmost longest non-overlapping matches, where overlapping means sharing
> the same matching characters.
Policy 2 for multiple matches can be summarized as:
> All leftmost longest non-overlapping matches, where overlapping means sharing
> the same matching characters, excluding zero-length matches that coincide with
> the end of a non-zero-length match.
Policy 0 has been:
> All leftmost longest non-overlapping matches, where overlapping means sharing
> the same matching characters, stopping with the first zero-length match.
Policy 5 would be even simpler in this language:
> The longest match at each position where a match occurs.
Policy 5 can be filtered to get policy 1.
Policy 1 can be filtered to get policy 0 or to get policy 2.
The best practices for a regexp search with multiple matches is to not use a
regexp that can match zero-length bits of text (though it may be handy for
finding newlines there are simpler ways).
So let me also put for "Policy 6"
> All leftmost longest non-overlapping matches, where overlapping means sharing
> the same matching characters, excluding all zero-length matches.
Note that the only zero-length POSIX extended regular expression matches are ()
which always matches or ^ or $ or ^$ (which is the same as $^). But the policy
I choose here will likely be applied to other regular expression backends that
can do fancier testing.
Thanks for the comments!
Chris
More information about the Libraries
mailing list