[Haskell] ANNOUNCE: HaRP (Haskell Regular Patterns) version 0.1
Niklas Broberg
n_broberg at hotmail.com
Sat May 15 13:08:53 EDT 2004
A wise man once said, release early and release often. We're obviously not
very wise... ;)
===============
Announcing HaRP 0.1
===============
HaRP is a Haskell extension that extends the normal pattern matching
facility with the power of regular expressions. This expressive power is
highly useful in a wide range of areas, including text parsing and XML
processing [1]. Regular expression patterns in HaRP work over ordinary
Haskell lists ([]) of arbitrary type.
We have implemented HaRP as a pre-processor to ordinary Haskell.
Where to get it:
HaRP can be downloaded from http://www.dtek.chalmers.se/~d00nibro/harp/
along with notes on installation and usage.
Description:
Simple pattern matching on concrete, fully specified lists can be done in
Haskell as so:
foo [Foo, Bar, Baz] = ...
We add an extension of this, regular pattern matching, an example:
foo [/ Foo Bar* Baz /] = ...
The intuition of the above is that we get a match for any list that starts
with a Foo, ends with a Baz, and has zero or more Bar in between. If you
have used regular expressions in any other language, this should not be new
to you.
Regular patterns that can be used are:
* - match zero or more
+ - match one or more
? - match zero or one
a | b - match a or b
(/ a b /) - match the sequence of a, then b (this is also implicit in the
top level [/ ... /]).
For the three first, there is also the option of adding a ? afterwords to
make the match non-greedy (the default is greedy). This means that p* tries
to match p as many times as possible while still satisfying the whole
pattern, whereas p*? tries to match p as few times as possible.
Introducing regular expressions into the pattern matching facility gives
some extra nice features. One is that the regular patterns are "type safe",
i.e. they are not encoded in strings. Another is that identifiers can be
named and bound inside regular patterns, examples:
foo [/ _* a /] = ... => a is bound to the last element of the list
foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the
first two elements
foo [/ (/ a _ /)* /] = ... => a is bound to the list of the first, third,
fifth etc elements of a list of even length
Note that binding variables implicitly (i.e. without using @) is context
dependent in regular patterns. This is because for some variables appearing
in certain contexts, we cannot know the number of times that particular
variable will be matched. Looking at the last example above, we see that the
varible a appears inside the context of a *, meaning it can be matched zero
or more times.
A variable bound in such a context will contain the list of all values
matched to it, whereas a normal linear variable is bound to exactly the
value it matches, like the a in the first example above.
Patterns that introduce non-linear contexts are *, +, ? (and the non-greedy
versions), and | (union).
For explicitly bound variables (i.e. variables bound using @) we must also
look at types of matched sub-patterns. In the example
foo [/ a@(/ _ _ /) _* /] = ... => a is bound to the list containing the
first two elements
we clearly see that the sequencing sub-pattern has a list type.
The types of sub-patterns are as follows (a :: a, b :: b):
a* => [a]
a+ => [a]
a? => Maybe a
(/ ... /) => [e], where e is the type of the elements in the list matched,
regardless of sub-patterns
( a | b ) => Either a b
We also introduce an explicit binding operator for non-linear bindings,
called @: (read "as-cons" or "accumulating as"), which adds each match of
its associated pattern to a list of matches.
An example:
foo [/ (_ a@:(/ _ _ /))* /] = ... => a is bound to a list of lists (exactly
what the elements will be is left as an exercise to the reader ;)
A more complete example using all the presented features:
foo [/ _ a at 1 b c at 3* 4+ d at 5? e@(/ f@:6 g /)* h@( 8 | (/ 9 i /) ) /] =
(a,b,c,d,e,f,g,h,i)
Assuming all the numerical literals are of type Int, foo will have the
following type:
foo :: [Int] -> (Int, Int, [Int], Maybe Int, [[Int]], [Int], [Int], Either
Int [Int], [Int])
Examples of applying foo to some lists:
(NOTE, show is generally not defined over tuples this large, so to test
these examples you need to do some trick, either define an instance for show
or simply nest the tuples so that each is no larger than what can be shown)
?> foo [0,1,2,3,4,5,6,7,8]
(1, 2, [3], Just 5, [[6,7]], [6], [7], Left 8, [])
?> foo [0,1,2,3,3,3,4,6,0,6,1,6,2,9,10]
(1,2,[3,3,3], Nothing, [[6,0],[6,1],[6,2]], [6,6,6], [0,1,2], Right [9,10],
[10])
Discussion of each variable in detail:
a :: Int - a binds to a single element at top level (top level meaning it is
bound outside any numerating pattern).
b :: Int - b binds to a single element at top level.
c :: [Int] - c is bound to a zero-or-many pattern, and it will contain all
the matches of the sub-pattern, in this case all matches of 3.
d :: Maybe Int - d is bound to a zero-or-one pattern, and it will be Nothing
in case of zero matches, and Just the match to the sub-pattern in case of a
match, in this case 5.
e :: [[Int]] - e is bound to a zero-or-more pattern, and will thus contain a
list of all the matches of the sub-pattern. In this case the sub-pattern is
a sequence, which has a list type, so the type of e is a list of lists.
f :: [Int] - f is bound using the list-binding operator @:, so its type will
always be a list of the type of the sub-pattern, regardless of the context
it appears in. It will contain all matches of the sub-pattern (Note that a
normal bind using @ would have been illegal here). At top level (and in
ordinary pattern matching), the pattern foo is equivalent to foo at _, but
inside numerating patterns the pattern foo is equivalent to foo@:_. (see
discussion below)
g :: [Int] - g is equivalent to g@:_ as mentioned above, so the same will
hold for g as for f.
h :: Either Int [Int] - h is bound to a choice pattern (or union pattern if
you prefer), so it will be bound to the match of one of the two
sub-patterns, annotated with Left or Right. In this case the left
sub-pattern matches a single element of type Int, whereas the right
sub-pattern matches a sequence of type [Int].
i :: [Int] - Since the choice pattern is numerating (each of the
sub-patterns are matched zero or one times), i is equivalent to i@:_.
For completeness, another example to show how sequences work:
bar :: [Int] -> [Int]
bar [/ 0 a@(/ 1* 2 (3|4) (/ 5 6 /) 7? /) /] = a
In this case a will have the type [Int], since a sequence will always have
the type [e] where e is the type of the elements of the list to match. So in
this example,
?> bar [0,2,3,5,6]
[2,3,5,6]
?> bar [0,1,1,1,2,4,5,6,7]
[1,1,1,2,4,5,6,7]
A slightly more useful, real-life example:
Assume a config file (or the like) of the following form:
option-name : option-value
option-name : option-value
...
Parsing this into name-value pairs can be done like so:
parseConf :: String -> [(String, String)]
parseConf str =
let [/ (/ names*? ' '* ':' ' '* vals*? '\n' /)* /] = str
in zip names vals
Hopefully that's enough examples, it should be fairly clear how it all
works. =)
Regarding @ vs @:, it would be fully possible to implement this just using @
and change its behavior depending on the context it appears in, much like we
do with identifiers bound without using the explicit @ operator. We feel
that doing so could lead to (even more) confusion regarding how variables
are bound, and have therefore chosen to introduce the extra @: operator to
make this differing behavior explicit. That identifiers bound without a @ or
@: have differing semantics depending on context is unfortunate but
unavoidable, and we feel that the added confusion is minor in this case.
Open issues:
* Greedy vs. non-greedy matching:
The current implementation is greedy by default, but some voices have been
raised (on this list) that non-greedy matching would be better as default.
After some initial use of the system we have also come to find that we tend
to use non-greedy patterns far more often than their greedy counterparts.
Unless we hear some convincing arguments not to, it is very likely that our
next release will have non-greedy patterns as the default.
* Strings:
Strings are a special syntactic case of a list, and we are planning an
analogous special case of regular patterns for it, for instance [s/ "Hello "
a* /] would be equal to [/ 'H' 'e' 'l' 'l' 'o' ' ' a* /], but this is not
yet implemented.
Any and all comments are welcome and appreciated,
Niklas Broberg, d00nibro[at]dtek.chalmers.se
Andreas Farre, d00farre[at]dtek.chalmers.se
Chalmers University of Technology
[1] XML processing is actually what we need these regular patterns for. Feel
free to visit the project that lead to this spin-off, Haskell Server Pages,
at http://www.dtek.chalmers.se/~d00nibro/hsp/
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
More information about the Haskell
mailing list