[Haskell] regular expression syntax - perl ain't got nothin on
haskell
Per Larsson
per at L4i.se
Tue Feb 24 14:24:29 EST 2004
On Tuesday 24 February 2004 03.07, John Meacham wrote:
> Inspired by an idea by Andrew Pang and an old project of mine, I decided
> to fill out a reusable regular expression library which is similar to
> Perl's, but much more expressive.
> ...
Hi,
Thanks! I am grateful of your efforts because I have since long missed some
typical text processing functionality in haskell. (Besides a more complete
regular expression library I think many haskellers miss string constructors
like an 'official' version of printf/format and maybe also some sort of
'here documents'.) Below follows some of my thoughts regarding a complete
regex library, in the hope that this will be of any inspiration.
1. Replacement
A regex library must contain functions for replacement with regular
expressions. One could think this is trivial to implement given a match
function, but there are some tricky choices to be made regarding empty
matches (this also applies to splitting a string into fields with a regexp).
Also there is questions about the interface of these functions.
In my own Text.Regex wrapper I have the functions.
data Match = Match {before :: String, after :: String, groups :: [String]}
...
substWithPat :: Rexex -> String -> (Int -> Bool) -> String -> (String,Int)
substWithFun :: Regex -> (Match->String) -> (Int->Bool) ->
String -> (String,Int)
substWithFunM :: (Monad m) => Regex -> (Match -> m String) ->
(Int->Bool) -> String -> m (String,Int)
Where a call to 'substWithPat pat rpat mode str' replaces matches
of 'pat' in 'str' by 'rpat' and returns the resulting string and the number of
replacements done. The 'rpat' replace pattern can contain backreferences on
the form \m where \m refers to the mth subgroup in the corresponding match
(\0 refers to the entire match). The call replaces only 'replaceable
matches'. A match m is replaceable if its the nth match and (mode n) is true
and, m is either the first match, a proper match or an empty match succeding
an empty match. This schema gives results which are conformant with replace
functionality in several other regex libraries, e.g. in Tcl, Python and Perl.
For example, replacing matches of "_*" by "_" in "awk" gives "_a_w_k_", and
replacing matches of "_*" by "_" in "sed_and_awk" gives "_s_e_d_a_n_d_a_w_k".
(Compare the discussion in 'Mastering Regular Expressions', O'Reilly, pages
187-188.) The functions substWithFun and substWithFunM are obvious variations
on the substWithPat function.
2. Constructing regular expressions.
There is the well known problem that the backslash is used both as a string
escape character and a regexp operator. I know of three approaches to the
problem:
a) Bite the bullet and, e.g. write regexps like "\\\\" in order to match a
single backslash (e.g. as in emacs lisp).
b) Use a language extensions for 'raw' strings where the backslash is not
interpreted (e.g. /regex/ in awk, r"regex" in python and {regex} in Tcl).
c) Use a different operator than the backslash in regular expressions, this
has the benefit of not demanding a language extension, but is nonstandard on
the negative side.
There is also the problem with inserting string values in regular expressions.
Appending with ++ is not particular convenient with complicated regexps
because the result can be rather unreadable. I suppose we have to wait for a
standard implementation of printf in template haskell for this problem.
Cheers
Per
More information about the Haskell
mailing list