[Haskell] regular expression syntax - perl ain't got nothin on haskell

Tue Feb 24 14:24:29 EST 2004

On Tuesday 24 February 2004 03.07, John Meacham wrote:
> Inspired by an idea by Andrew Pang and an old project of mine, I decided
> to fill out a reusable regular expression library which is similar to
> Perl's, but much more expressive.
>  ...

Hi,
Thanks! I am grateful of your efforts because I have since long missed some 
typical text processing functionality in haskell. (Besides a more complete 
regular expression library I think many haskellers miss string constructors 
like an 'official' version of printf/format  and maybe also some sort of 
'here documents'.) Below follows some of my thoughts regarding a complete 
regex library, in the hope that this will be of any inspiration.

1. Replacement
A regex library must contain functions for replacement with regular 
expressions. One could think this is trivial to implement given a match 
function, but there are some tricky choices to be made regarding empty 
matches (this also applies to splitting a string into fields with a regexp). 
Also there is questions about the interface of these functions.
In my own Text.Regex wrapper I have the functions.

  data Match = Match {before :: String, after :: String, groups :: [String]}
  ...
  substWithPat :: Rexex -> String -> (Int -> Bool) -> String -> (String,Int)
  substWithFun :: Regex -> (Match->String) -> (Int->Bool) -> 
                               String -> (String,Int)
  substWithFunM :: (Monad m) => Regex -> (Match -> m String) -> 
                                 (Int->Bool) -> String -> m (String,Int)

Where a call to 'substWithPat pat rpat mode str' replaces matches
of 'pat' in 'str' by 'rpat' and returns the resulting string and the number of 
replacements done. The 'rpat' replace pattern can contain backreferences on 
the form \m where \m refers to the mth subgroup in the corresponding match 
(\0 refers to the entire match). The call replaces only 'replaceable 
matches'.  A match m is replaceable if its the nth match and (mode n) is true 
and,  m is either the first match, a proper match or an empty match succeding 
an empty match. This schema gives results which are conformant with replace 
functionality in several other regex libraries, e.g. in Tcl, Python and Perl. 
For example, replacing matches of "_*" by "_" in "awk" gives "_a_w_k_", and 
replacing matches of "_*" by "_" in "sed_and_awk" gives "_s_e_d_a_n_d_a_w_k".  
(Compare the discussion in 'Mastering Regular Expressions', O'Reilly, pages 
187-188.) The functions substWithFun and substWithFunM are obvious variations 
on the substWithPat function.

2. Constructing regular expressions.
There is the well known problem that the backslash is used both as a string 
escape character and a regexp operator. I know of three approaches to the 
problem:
 a) Bite the bullet and, e.g. write regexps like "\\\\" in order to match a 			
single backslash (e.g. as in emacs lisp).
b) Use a language extensions for 'raw' strings where the backslash is not 
interpreted (e.g. /regex/ in awk, r"regex" in python and {regex} in Tcl).
c) Use a different operator than the backslash in regular expressions, this 
has the benefit of not demanding a language extension, but is nonstandard on 
the negative side. 
There is also the problem with inserting string values in regular expressions. 
Appending with ++ is not particular convenient with complicated regexps 
because the result can be rather unreadable. I suppose we have to wait for a
standard implementation of printf in template haskell for this problem.

Cheers
Per