Policy change for regex libraries
Chris Kuklewicz
haskell at list.mightyreason.com
Mon Jan 12 17:06:31 EST 2009
Before I go and add or change anything in the Haskell regex-posix library, I
wanted to get some feedback. regex-posix provides Text.Regex.Posix, and is
built on the regex-base package.
regex-posix is also used as the backend for Text.Regex (the regex-compat package
does this). I do not intend to change the behavior of the old Text.Regex API.
The main issue is the behavior when returning a list of all matches of a Regex
against a target text. I no longer think the current behavior is the right
choice when it comes to zero-length matches.
The current behavior is to return non-overlapping matches with the caveat that
after the first zero-length match the search is ended. Note that the
zero-length match may be occur at the end position of a previous non-zero-length
match.
Notably, no one has complained about this policy. But I no longer like it. So
here are a few of my ideas of what to change it to:
0) No change, not worth the effort.
1) return the zero-length match, skip forward 1 character, and continue
searching. If the consumer wishes the old policy they can truncate the list.
This could also be filtered to resemble option 2 below.
2) Mimic "sed". It seems "sed" has a policy where a zero-length match is
forbidden to occur at the end position of a non-zero-length match. "sed" does
not stop with the first zero-length match.
3) implement additional execution options, so the user can choose a policy. The
default policy choice left with the current behavior.
4) implement additional execution options, so the user can choose a policy. The
default policy choice set to he behavior in (1).
5) Return valid matches starting from all positions, including overlapping
matches. This I really do not like and one can run the search starting one
character after the start of the last match to get this information.
Matching "0123" and replacing all matches with themselves wrapped in angle
brackets. The policies of 0, 1, and 2 above lead to (computed partly by hand):
regex of "[0123]?"
0): "<0><1><2><3><>"
1): "<0><1><2><3><>"
2): "<0><1><2><3>"
regex of "[012]?"
0): "<0><1><2>3<>"
1): "<0><1><2>3<>"
2): "<0><1><2>3<>"
regex of "[013]?"
0): "<0><1><>23"
1): "<0><1><>2<3><>"
2): "<0><1>2<3>"
regex of "[023]?"
0): "<0><>123"
1): "<0><>1<2><3><>"
2): "<0>1<2><3>"
regex of "[123]?"
0): "<>0123"
1): "<>0<1><2><3><>"
2): "<>0<1><2><3>"
regex of "[03]?"
0): "<0><>123"
1): "<0><>1<>2<3><>"
2): "<0>1<>2<3>"
regex of "[03]?"
0): "<0><>123"
1): "<0><>1<>2<3><>"
2): "<0>1<>2<3>"
regex of "[12]?"
0): "<>0123"
1): "<>0<1><2><>3<>"
2): "<>0<1><2>3<>"
I am leaning to simply changing it from policy 0 to policy 1.
Are there any objections?
Perhaps I should set a deadline? Now where is that library policy...
--
Chris
More information about the Libraries
mailing list