Regular Patterns (RE: [Haskell] regular expression syntax)

Fri Feb 27 14:04:46 EST 2004

>Subject: [Haskell] regular expression syntax - perl ain't got nothin on 
>haskell

Agreed, Perl certainly ain't got nothing on Haskell, but we could go even 
further than just imitating (although better than the original) Perl 
functionality. =)
Hence:

As a spin-off of another project [1], we found ourselves in need of a nice 
way of expressing regular expression matching in Haskell, and not only for 
strings. We came up with the idea of HaRP: Haskell Regular Patterns. We have 
implemented this as a pre-processor to ordinary Haskell and it works quite 
well. However it is something we are still working on, and if it hadn't been 
for this discussion on the list we would probably have waited a few weeks 
more before presenting it, but here goes:

Simple pattern matching on concrete, fully specified lists can be done in 
Haskell as so:
foo [Foo, Bar, Baz] = ...

We propose an extension of this to regular pattern matching, example:
foo [/ Foo Bar* Baz /] = ...

The intuition of the above is that we get a match for any list that starts 
with a Foo, ends with a Baz, and has zero or more Bar in between.

Regular patterns that can be used are:
* - match zero or more
+ - match one or more
? - match zero or one
a | b - match a  or b
(/ a b /) - match the sequence of a, then b (this is also implicit in the 
top level [/ ... /]).

Applying the regular expressions on patterns gives some extra nice features. 
One is that the regular patterns are "type safe", i.e. they are not encoded 
in strings. Another is that identifiers can be named and bound inside 
regular patterns, examples:
foo [/ _* a /] = ... => a is bound to the last element of the list
foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the 
first two elements
foo [/ (/ a _ /)* /] = ... => a is bound to the list of the first, third, 
fifth etc elements of a list of even length

Note that since we introduce bindings of identifiers inside what we call 
numerating patterns, i.e. patterns that imply an uncertain number of 
matches, identifiers are automatically bound to lists inside such patterns. 
We also introduce an explicit list-binding operator, @:, for accumulating 
items in lists as so:
foo [/ (_ a@:(/ _ _ /))* /] = ... => a is bound to a list of lists (exactly 
what the elements will be is left as an exercise to the reader ;)

The types of sub-patterns are as follows (a :: a, b :: b):
a* => [a]
a+ => [a]
a? => Maybe a
(/ ... /) => [e],  where e is the type of the elements in the list matched, 
regardless of sub-patterns
( a | b ) => Either a b

As more complete example using all the presented features:

foo [/ _ a at 1 b c at 3* 4+ d at 5? e@(/ f@:6 g /)* h@( 8 | (/ 9 i /) )  /] = 
(a,b,c,d,e,f,g,h,i)

Assuming all the numerical literals are of type Int, foo will have the 
following type:

foo :: [Int] -> (Int, Int, [Int], Maybe Int, [[Int]], [Int], [Int], Either 
Int [Int], [Int])

Examples of applying foo to some lists:

?> foo [0,1,2,3,4,5,6,7,8]
(1, 2, [3], Just 5, [[6,7]], [6], [7], Left 8, [])

?> foo [0,1,2,3,3,3,4,6,0,6,1,6,2,9,10]
(1,2,[3,3,3], Nothing, [[6,0],[6,1],[6,2]], [6,6,6], [0,1,2], Right [9,10], 
[10])

Discussion of each variable in detail:
a :: Int - a binds to a single element at top level (top level meaning it is 
bound outside any numerating pattern).
b :: Int - b binds to a single element at top level.
c :: [Int] - c is bound to a zero-or-many pattern, and it will contain all 
the matches of the sub-pattern, in this case all matches of 3.
d :: Maybe Int - d is bound to a zero-or-one pattern, and it will be Nothing 
in case of zero matches, and Just the match to the sub-pattern in case of a 
match, in this case 5.
e :: [[Int]] - e is bound to a zero-or-more pattern, and will thus contain a 
list of all the matches of the sub-pattern. In this case the sub-pattern is 
a sequence, which has a list type, so the type of e is a list of lists.
f :: [Int] - f is bound using the list-binding operator @:, so its type will 
always be a list of the type of the sub-pattern, regardless of the context 
it appears in. It will contain all matches of the sub-pattern (Note that a 
normal bind using @ would have been illegal here). At top level (and in 
ordinary pattern matching), the pattern foo is equivalent to foo at _, but 
inside numerating patterns the pattern foo is equivalent to foo@:_. (see 
discussion below)
g :: [Int] - g is equivalent to g@:_ as mentioned above, so the same will 
hold for g as for f.
h :: Either Int [Int] - h is bound to a choice pattern (or union pattern if 
you prefer), so it will be bound to the match of one of the two 
sub-patterns, annotated with Left or Right. In this case the left 
sub-pattern matches a single element of type Int, whereas the right 
sub-pattern matches a sequence of type [Int].
i :: [Int] - Since the choice pattern is numerating (each of the 
sub-patterns are matched zero or one times), i is equivalent to i@:_.

For completeness, another example to show how sequences work:
bar :: [Int] -> [Int]
bar [/ 0 a@(/ 1* 2 (3|4) (/ 5 6 /) 7? /) /] = a

In this case a will have the type [Int], since a sequence will always have 
the type [e] where e is the type of the elements of the list to match. So in 
this example,

?> bar [0,2,3,5,6]
[2,3,5,6]
?> bar [0,1,1,1,2,4,5,6,7]
[1,1,1,2,4,5,6,7]

Hopefully that's enough examples, it should be fairly clear how it all 
works. =)

Regarding @ vs @:, it would be fully possible to implement this just using @ 
and change its behavior depending on the context it appears in, much like we 
do with identifiers bound without using the explicit @ operator. We feel 
that doing so could lead to (even more) confusion regarding how variables 
are bound, and have therefore chosen to introduce the extra @: operator to 
make this differing behavior explicit. That identifiers bound without a @ or 
@: have differing semantics depending on context is unfortunate but 
unavoidable, and we feel that the added confusion is minor in this case.

As we said at the beginning we have implemented this as a preprocessor to 
ordinary Haskell, but it still needs some more alpha-testing before we 
release it... we need to catch the big fish first.

There are a number of open issues still, one is the choice of delimeter 
token. We have chosen to delimit patterns with just a whitespace, but using 
a comma might be better. The motivation for the current choice is that [/ a, 
b*, c /] lends the feeling that each pattern is an element of sorts, when in 
fact each pattern may represent any number of elements. Any input on this?

Another open issue is greedy vs. non-greedy matching of * and +. The current 
implementation is greedy, but some voices were raised earlier on this list 
that non-greedy matching would be better as default. We envision a system 
where you can choose behavior by adding "flags" to the pattern, i.e. [g/ 
.... /] might indicate a greedy match etc.

Strings are a special syntactic case of a list, and we are planning a 
special case of regular patterns for it, for instance [s/ "Hello " a* /] 
would be equal to [/ 'H' 'e' 'l' 'l' 'o' ' ' a* /], but this is not yet 
implemented.

Of course, unfortunately, there is also a downside.
Our preprocessor is implemented by translating regular patterns into code in 
a parser monad. Due to the way that matching is done, matching will always 
be "strict" in the sense that if you apply the function

foo [/ a* /] = a

to an infinite list, the call will loop forever. At this point we see no 
reasonable way around this, there may be some special cases we can catch but 
we see no overall solution to the problem.

Another drawback, at least in our implementation, is efficiency. Using an 
external C engine for matching regular expression would far outperform our 
system, no matter how efficient we make our backend. Thus this is not 
intended to replace any existing functionality, rather we envision it as a 
very high-level interface to matching regular expressions in the spirit of 
String being [Char], where Text.Regex can be compared to using PackedString 
for critical programs.

Perhaps by integrating regular patterns directly into a compiler like ghc we 
could use ugly low-level behind-the-scenes tricks for doing the actual 
matching in an external engine, thus greatly increasing efficiency. Any 
thoughts? =)

We will supply the preprocessor for public scrutiny some time next week, we 
just need to do some more alpha-testing first... Comment away!

regards,

Niklas Broberg, d00nibro[at]dtek.chalmers.se
Andreas Farre, d00farre[at]dtek.chalmers.se
Chalmers University of Technology

[1] Feel free to also visit the project that lead to this spin-off, Haskell 
Server Pages
http://www.dtek.chalmers.se/~d00nibro/hsp/

_________________________________________________________________
Take off on a romantic weekend or a family adventure to these great U.S. 
locations. http://special.msn.com/local/hotdestinations.armx