Policy change for regex libraries

Chris Kuklewicz haskell at list.mightyreason.com
Mon Jan 12 18:16:15 EST 2009


mail at justinbogner.com wrote:
 > Policy 2 seems the most intuitive to me, I can't imagine a situation
 > where you would want the empty match at the end of a non-empty match. On
 > the other hand, I don't think I've ever used a regular expression that
 > matched an empty string, so it's not particularly important to me.

The authors of sed are in agreement with your intuition.  But I think policy 2 
as a recursive definition it is unusual.  And I see policy 2 as very hard to 
summarize.

The single-match is always easy to summarize: the leftmost longest match.

Policy 1 for multiple matches can be summarized as:

 > All leftmost longest non-overlapping matches, where overlapping means sharing
 > the same matching characters.

Policy 2 for multiple matches can be summarized as:

 > All leftmost longest non-overlapping matches, where overlapping means sharing
 > the same matching characters, excluding zero-length matches that coincide with
 > the end of a non-zero-length match.

Policy 0 has been:

 > All leftmost longest non-overlapping matches, where overlapping means sharing
 > the same matching characters, stopping with the first zero-length match.

Policy 5 would be even simpler in this language:

 > The longest match at each position where a match occurs.

Policy 5 can be filtered to get policy 1.
Policy 1 can be filtered to get policy 0 or to get policy 2.

The best practices for a regexp search with multiple matches is to not use a 
regexp that can match zero-length bits of text (though it may be handy for 
finding newlines there are simpler ways).

So let me also put for "Policy 6"

 > All leftmost longest non-overlapping matches, where overlapping means sharing
 > the same matching characters, excluding all zero-length matches.

Note that the only zero-length POSIX extended regular expression matches are () 
which always matches or ^ or $ or ^$ (which is the same as $^).  But the policy 
I choose here will likely be applied to other regular expression backends that 
can do fancier testing.

Thanks for the comments!
   Chris


More information about the Libraries mailing list