Regular Patterns (RE: [Haskell] regular expression syntax)
Niklas Broberg
n_broberg at hotmail.com
Fri Feb 27 14:04:46 EST 2004
>Subject: [Haskell] regular expression syntax - perl ain't got nothin on
>haskell
Agreed, Perl certainly ain't got nothing on Haskell, but we could go even
further than just imitating (although better than the original) Perl
functionality. =)
Hence:
As a spin-off of another project [1], we found ourselves in need of a nice
way of expressing regular expression matching in Haskell, and not only for
strings. We came up with the idea of HaRP: Haskell Regular Patterns. We have
implemented this as a pre-processor to ordinary Haskell and it works quite
well. However it is something we are still working on, and if it hadn't been
for this discussion on the list we would probably have waited a few weeks
more before presenting it, but here goes:
Simple pattern matching on concrete, fully specified lists can be done in
Haskell as so:
foo [Foo, Bar, Baz] = ...
We propose an extension of this to regular pattern matching, example:
foo [/ Foo Bar* Baz /] = ...
The intuition of the above is that we get a match for any list that starts
with a Foo, ends with a Baz, and has zero or more Bar in between.
Regular patterns that can be used are:
* - match zero or more
+ - match one or more
? - match zero or one
a | b - match a or b
(/ a b /) - match the sequence of a, then b (this is also implicit in the
top level [/ ... /]).
Applying the regular expressions on patterns gives some extra nice features.
One is that the regular patterns are "type safe", i.e. they are not encoded
in strings. Another is that identifiers can be named and bound inside
regular patterns, examples:
foo [/ _* a /] = ... => a is bound to the last element of the list
foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the
first two elements
foo [/ (/ a _ /)* /] = ... => a is bound to the list of the first, third,
fifth etc elements of a list of even length
Note that since we introduce bindings of identifiers inside what we call
numerating patterns, i.e. patterns that imply an uncertain number of
matches, identifiers are automatically bound to lists inside such patterns.
We also introduce an explicit list-binding operator, @:, for accumulating
items in lists as so:
foo [/ (_ a@:(/ _ _ /))* /] = ... => a is bound to a list of lists (exactly
what the elements will be is left as an exercise to the reader ;)
The types of sub-patterns are as follows (a :: a, b :: b):
a* => [a]
a+ => [a]
a? => Maybe a
(/ ... /) => [e], where e is the type of the elements in the list matched,
regardless of sub-patterns
( a | b ) => Either a b
As more complete example using all the presented features:
foo [/ _ a at 1 b c at 3* 4+ d at 5? e@(/ f@:6 g /)* h@( 8 | (/ 9 i /) ) /] =
(a,b,c,d,e,f,g,h,i)
Assuming all the numerical literals are of type Int, foo will have the
following type:
foo :: [Int] -> (Int, Int, [Int], Maybe Int, [[Int]], [Int], [Int], Either
Int [Int], [Int])
Examples of applying foo to some lists:
?> foo [0,1,2,3,4,5,6,7,8]
(1, 2, [3], Just 5, [[6,7]], [6], [7], Left 8, [])
?> foo [0,1,2,3,3,3,4,6,0,6,1,6,2,9,10]
(1,2,[3,3,3], Nothing, [[6,0],[6,1],[6,2]], [6,6,6], [0,1,2], Right [9,10],
[10])
Discussion of each variable in detail:
a :: Int - a binds to a single element at top level (top level meaning it is
bound outside any numerating pattern).
b :: Int - b binds to a single element at top level.
c :: [Int] - c is bound to a zero-or-many pattern, and it will contain all
the matches of the sub-pattern, in this case all matches of 3.
d :: Maybe Int - d is bound to a zero-or-one pattern, and it will be Nothing
in case of zero matches, and Just the match to the sub-pattern in case of a
match, in this case 5.
e :: [[Int]] - e is bound to a zero-or-more pattern, and will thus contain a
list of all the matches of the sub-pattern. In this case the sub-pattern is
a sequence, which has a list type, so the type of e is a list of lists.
f :: [Int] - f is bound using the list-binding operator @:, so its type will
always be a list of the type of the sub-pattern, regardless of the context
it appears in. It will contain all matches of the sub-pattern (Note that a
normal bind using @ would have been illegal here). At top level (and in
ordinary pattern matching), the pattern foo is equivalent to foo at _, but
inside numerating patterns the pattern foo is equivalent to foo@:_. (see
discussion below)
g :: [Int] - g is equivalent to g@:_ as mentioned above, so the same will
hold for g as for f.
h :: Either Int [Int] - h is bound to a choice pattern (or union pattern if
you prefer), so it will be bound to the match of one of the two
sub-patterns, annotated with Left or Right. In this case the left
sub-pattern matches a single element of type Int, whereas the right
sub-pattern matches a sequence of type [Int].
i :: [Int] - Since the choice pattern is numerating (each of the
sub-patterns are matched zero or one times), i is equivalent to i@:_.
For completeness, another example to show how sequences work:
bar :: [Int] -> [Int]
bar [/ 0 a@(/ 1* 2 (3|4) (/ 5 6 /) 7? /) /] = a
In this case a will have the type [Int], since a sequence will always have
the type [e] where e is the type of the elements of the list to match. So in
this example,
?> bar [0,2,3,5,6]
[2,3,5,6]
?> bar [0,1,1,1,2,4,5,6,7]
[1,1,1,2,4,5,6,7]
Hopefully that's enough examples, it should be fairly clear how it all
works. =)
Regarding @ vs @:, it would be fully possible to implement this just using @
and change its behavior depending on the context it appears in, much like we
do with identifiers bound without using the explicit @ operator. We feel
that doing so could lead to (even more) confusion regarding how variables
are bound, and have therefore chosen to introduce the extra @: operator to
make this differing behavior explicit. That identifiers bound without a @ or
@: have differing semantics depending on context is unfortunate but
unavoidable, and we feel that the added confusion is minor in this case.
As we said at the beginning we have implemented this as a preprocessor to
ordinary Haskell, but it still needs some more alpha-testing before we
release it... we need to catch the big fish first.
There are a number of open issues still, one is the choice of delimeter
token. We have chosen to delimit patterns with just a whitespace, but using
a comma might be better. The motivation for the current choice is that [/ a,
b*, c /] lends the feeling that each pattern is an element of sorts, when in
fact each pattern may represent any number of elements. Any input on this?
Another open issue is greedy vs. non-greedy matching of * and +. The current
implementation is greedy, but some voices were raised earlier on this list
that non-greedy matching would be better as default. We envision a system
where you can choose behavior by adding "flags" to the pattern, i.e. [g/
.... /] might indicate a greedy match etc.
Strings are a special syntactic case of a list, and we are planning a
special case of regular patterns for it, for instance [s/ "Hello " a* /]
would be equal to [/ 'H' 'e' 'l' 'l' 'o' ' ' a* /], but this is not yet
implemented.
Of course, unfortunately, there is also a downside.
Our preprocessor is implemented by translating regular patterns into code in
a parser monad. Due to the way that matching is done, matching will always
be "strict" in the sense that if you apply the function
foo [/ a* /] = a
to an infinite list, the call will loop forever. At this point we see no
reasonable way around this, there may be some special cases we can catch but
we see no overall solution to the problem.
Another drawback, at least in our implementation, is efficiency. Using an
external C engine for matching regular expression would far outperform our
system, no matter how efficient we make our backend. Thus this is not
intended to replace any existing functionality, rather we envision it as a
very high-level interface to matching regular expressions in the spirit of
String being [Char], where Text.Regex can be compared to using PackedString
for critical programs.
Perhaps by integrating regular patterns directly into a compiler like ghc we
could use ugly low-level behind-the-scenes tricks for doing the actual
matching in an external engine, thus greatly increasing efficiency. Any
thoughts? =)
We will supply the preprocessor for public scrutiny some time next week, we
just need to do some more alpha-testing first... Comment away!
regards,
Niklas Broberg, d00nibro[at]dtek.chalmers.se
Andreas Farre, d00farre[at]dtek.chalmers.se
Chalmers University of Technology
[1] Feel free to also visit the project that lead to this spin-off, Haskell
Server Pages
http://www.dtek.chalmers.se/~d00nibro/hsp/
_________________________________________________________________
Take off on a romantic weekend or a family adventure to these great U.S.
locations. http://special.msn.com/local/hotdestinations.armx
More information about the Haskell
mailing list