[Haskell-cafe] Re: ANN: weighted-regexp-0.1.0.0
Chris Kuklewicz
haskell at list.mightyreason.com
Wed Jul 28 18:07:48 EDT 2010
> Maybe I underestimated the utility of ^ and $. The definition seems
> intricate. I thought about adding a combinator for matching newline but
> now think that would lead to wrong start and end positions. For example
> the start position of the matching substring for ^a in "a\na" should
> be 2 not 1, right? Or is it 0 although there is no newline at the
> beginning?
The first "a" would match with indexes (0,1) and the second "a" would
match with indexes (1,2).
>
> Is there a page with examples that show how ^ and $ should behave exactly?
>
Without REG_NEWLINE the meanings are:
. matches any single character (though note that handling of a zero byte
is impossible for C style strings for a different reason).
^ is an assertion that instead of being AlwaysTrue (eps)or AlwaysFalse
(noMatch) is true before any characters have been accepted and false
afterward.
$ is an assertion that is true only when there are no more characters to
match and false before this.
With REG_NEWLINE the meanings are:
. matches any single character EXCEPT '\n' newline (ASCII 10, I think).
^ is true before any characters have been matched and true right after a
newline has been matched, else false.
^ is true when there are no more characters to match and true if the
next character to match is a newline, else false.
Let 'a' and 'b' and 'c' be some complicated regular expressions that
cannot accept a newline with REG_NEWLINE enabled:
^$ finds blank lines, the indexes between newlines or between a newline
and the start or end of the text.
^a$ requires 'a' to exactly fill a line and the captured string has no
newlines.
A more complicated use, perhaps as part of a crazy parser:
"(a(\n)?)(^|b)(c|$)" has 'a' much some text and perhaps the newline. If
the newline was there then the ^ matches and b might be skipped,
otherwise b must be used. The match ends with '(c|$)' is thus either
starting the new line or trailing b. And (c|$) can avoid matching 'c'
if the next character is a newline.
Note that the regular expression "(^|[aA])" has a non-trivial
"can_accept_empty" property: it can sometimes accept empty.
And if you are recording parenthetical captures then "(^)?" is subtle.
When ^ is true the (^) succeeds like () and when it is false it does
not. This inserts a test into the pattern that can be checked later.
And "((^$)|(^)|($))" is worse: it does not always succeed and which
sub-pattern gets captured depends on the presence of one or two newlines.
In "((^)|(^$))" it is impossible for (^$) to be used since the first (^)
will always be favored by the POSIX rules.
Similarly "(()|(^))" will never use (^).
A small chunk of regex-tdfa sifts through the possible ways to accept 0
characters for each node in the parse-tree and keeps an ordered list of
sets of assertions to check, and cleans outs those that are logically
excluded.
Slightly more useful anchors are added in Perl/PCRE:
> ANCHORS AND SIMPLE ASSERTIONS
> \b word boundary
> \B not a word boundary
> ^ start of subject
> also after internal newline in multiline mode
> \A start of subject
> $ end of subject
> also before newline at end of subject
> also before internal newline in multiline mode
> \Z end of subject
> also before newline at end of subject
> \z end of subject
> \G first matching position in subject
I added \b \B as above, and added \` \' to be like \A and \Z above, and
added \< and \> to be beginning and end of word assertions.
With enough assertions and negated assertions one could level up to
using a binary decision diagram to express when a sub-pattern can accept
0 characters.
Ville's libtre gets this wrong:
> Searched text: "searchme"
> Regex pattern: "((s^)|(s)|(^)|($)|(^.))*"
> Expected output: "(0,1)(0,1)(-1,-1)(0,1)(-1,-1)(-1,-1)(-1,-1)"
> Actual result : "(0,1)(0,1)(-1,-1)(0,1)(1,1)(1,1)(-1,-1)"
And sometimes very wrong:
> Searched text: "searchme"
> Regex pattern: "s(^|())e"
> Expected output: "(0,2)(1,1)(1,1)"
> Actual result : "NOMATCH"
Cheers,
Dr. Chris Kuklewicz
More information about the Haskell-Cafe
mailing list