[Haskell] ANNOUNCE: HaRP (Haskell Regular Patterns) version 0.1

Sat May 15 13:08:53 EDT 2004

A wise man once said, release early and release often. We're obviously not 
very wise... ;)

	===============
	 Announcing HaRP 0.1
	===============

HaRP is a Haskell extension that extends the normal pattern matching 
facility with the power of regular expressions. This expressive power is 
highly useful in a wide range of areas, including text parsing and XML 
processing [1]. Regular expression patterns in HaRP work over ordinary 
Haskell lists ([]) of arbitrary type.
We have implemented HaRP as a pre-processor to ordinary Haskell.

Where to get it:
HaRP can be downloaded from http://www.dtek.chalmers.se/~d00nibro/harp/ 
along with notes on installation and usage.

Description:

Simple pattern matching on concrete, fully specified lists can be done in 
Haskell as so:
foo [Foo, Bar, Baz] = ...

We add an extension of this, regular pattern matching, an example:
foo [/ Foo Bar* Baz /] = ...

The intuition of the above is that we get a match for any list that starts 
with a Foo, ends with a Baz, and has zero or more Bar in between. If you 
have used regular expressions in any other language, this should not be new 
to you.

Regular patterns that can be used are:
* - match zero or more
+ - match one or more
? - match zero or one
a | b - match a  or b
(/ a b /) - match the sequence of a, then b (this is also implicit in the 
top level [/ ... /]).

For the three first, there is also the option of adding a ? afterwords to 
make the match non-greedy (the default is greedy). This means that p* tries 
to match p as many times as possible while still satisfying the whole 
pattern, whereas p*? tries to match p as few times as possible.

Introducing regular expressions into the pattern matching facility gives 
some extra nice features. One is that the regular patterns are "type safe", 
i.e. they are not encoded in strings. Another is that identifiers can be 
named and bound inside regular patterns, examples:
foo [/ _* a /] = ... => a is bound to the last element of the list
foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the 
first two elements
foo [/ (/ a _ /)* /] = ... => a is bound to the list of the first, third, 
fifth etc elements of a list of even length

Note that binding variables implicitly (i.e. without using @) is context 
dependent in regular patterns. This is because for some variables appearing 
in certain contexts, we cannot know the number of times that particular 
variable will be matched. Looking at the last example above, we see that the 
varible a appears inside the context of a *, meaning it can be matched zero 
or more times.
A variable bound in such a context will contain the list of all values 
matched to it, whereas a normal linear variable is bound to exactly the 
value it matches, like the a in the first example above.
Patterns that introduce non-linear contexts are *, +, ? (and the non-greedy 
versions), and | (union).

For explicitly bound variables (i.e. variables bound using @) we must also 
look at types of matched sub-patterns. In the example

foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the 
first two elements

we clearly see that the sequencing sub-pattern has a list type.

The types of sub-patterns are as follows (a :: a, b :: b):
a* => [a]
a+ => [a]
a? => Maybe a
(/ ... /) => [e],  where e is the type of the elements in the list matched, 
regardless of sub-patterns
( a | b ) => Either a b

We also introduce an explicit binding operator for non-linear bindings, 
called @: (read "as-cons" or "accumulating as"), which adds each match of 
its associated pattern to a list of matches.
An example:

foo [/ (_ a@:(/ _ _ /))* /] = ... => a is bound to a list of lists (exactly 
what the elements will be is left as an exercise to the reader ;)

A more complete example using all the presented features:

foo [/ _ a at 1 b c at 3* 4+ d at 5? e@(/ f@:6 g /)* h@( 8 | (/ 9 i /) )  /] = 
(a,b,c,d,e,f,g,h,i)

Assuming all the numerical literals are of type Int, foo will have the 
following type:

foo :: [Int] -> (Int, Int, [Int], Maybe Int, [[Int]], [Int], [Int], Either 
Int [Int], [Int])

Examples of applying foo to some lists:
(NOTE, show is generally not defined over tuples this large, so to test 
these examples you need to do some trick, either define an instance for show 
or simply nest the tuples so that each is no larger than what can be shown)

?> foo [0,1,2,3,4,5,6,7,8]
(1, 2, [3], Just 5, [[6,7]], [6], [7], Left 8, [])

?> foo [0,1,2,3,3,3,4,6,0,6,1,6,2,9,10]
(1,2,[3,3,3], Nothing, [[6,0],[6,1],[6,2]], [6,6,6], [0,1,2], Right [9,10], 
[10])

Discussion of each variable in detail:
a :: Int - a binds to a single element at top level (top level meaning it is 
bound outside any numerating pattern).
b :: Int - b binds to a single element at top level.
c :: [Int] - c is bound to a zero-or-many pattern, and it will contain all 
the matches of the sub-pattern, in this case all matches of 3.
d :: Maybe Int - d is bound to a zero-or-one pattern, and it will be Nothing 
in case of zero matches, and Just the match to the sub-pattern in case of a 
match, in this case 5.
e :: [[Int]] - e is bound to a zero-or-more pattern, and will thus contain a 
list of all the matches of the sub-pattern. In this case the sub-pattern is 
a sequence, which has a list type, so the type of e is a list of lists.
f :: [Int] - f is bound using the list-binding operator @:, so its type will 
always be a list of the type of the sub-pattern, regardless of the context 
it appears in. It will contain all matches of the sub-pattern (Note that a 
normal bind using @ would have been illegal here). At top level (and in 
ordinary pattern matching), the pattern foo is equivalent to foo at _, but 
inside numerating patterns the pattern foo is equivalent to foo@:_. (see 
discussion below)
g :: [Int] - g is equivalent to g@:_ as mentioned above, so the same will 
hold for g as for f.
h :: Either Int [Int] - h is bound to a choice pattern (or union pattern if 
you prefer), so it will be bound to the match of one of the two 
sub-patterns, annotated with Left or Right. In this case the left 
sub-pattern matches a single element of type Int, whereas the right 
sub-pattern matches a sequence of type [Int].
i :: [Int] - Since the choice pattern is numerating (each of the 
sub-patterns are matched zero or one times), i is equivalent to i@:_.

For completeness, another example to show how sequences work:
bar :: [Int] -> [Int]
bar [/ 0 a@(/ 1* 2 (3|4) (/ 5 6 /) 7? /) /] = a

In this case a will have the type [Int], since a sequence will always have 
the type [e] where e is the type of the elements of the list to match. So in 
this example,

?> bar [0,2,3,5,6]
[2,3,5,6]
?> bar [0,1,1,1,2,4,5,6,7]
[1,1,1,2,4,5,6,7]

A slightly more useful, real-life example:
Assume a config file (or the like) of the following form:

option-name : option-value
option-name : option-value
...

Parsing this into name-value pairs can be done like so:

parseConf :: String -> [(String, String)]
parseConf str =
  let [/ (/ names*? ' '* ':' ' '* vals*? '\n' /)* /] = str
   in zip names vals

Hopefully that's enough examples, it should be fairly clear how it all 
works. =)

Regarding @ vs @:, it would be fully possible to implement this just using @ 
and change its behavior depending on the context it appears in, much like we 
do with identifiers bound without using the explicit @ operator. We feel 
that doing so could lead to (even more) confusion regarding how variables 
are bound, and have therefore chosen to introduce the extra @: operator to 
make this differing behavior explicit. That identifiers bound without a @ or 
@: have differing semantics depending on context is unfortunate but 
unavoidable, and we feel that the added confusion is minor in this case.

Open issues:

* Greedy vs. non-greedy matching:
The current implementation is greedy by default, but some voices have been 
raised (on this list) that non-greedy matching would be better as default. 
After some initial use of the system we have also come to find that we tend 
to use non-greedy patterns far more often than their greedy counterparts. 
Unless we hear some convincing arguments not to, it is very likely that our 
next release will have non-greedy patterns as the default.

* Strings:
Strings are a special syntactic case of a list, and we are planning an 
analogous special case of regular patterns for it, for instance [s/ "Hello " 
a* /] would be equal to [/ 'H' 'e' 'l' 'l' 'o' ' ' a* /], but this is not 
yet implemented.

Any and all comments are welcome and appreciated,

Niklas Broberg, d00nibro[at]dtek.chalmers.se
Andreas Farre, d00farre[at]dtek.chalmers.se
Chalmers University of Technology

[1] XML processing is actually what we need these regular patterns for. Feel 
free to visit the project that lead to this spin-off, Haskell Server Pages, 
at http://www.dtek.chalmers.se/~d00nibro/hsp/

_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail