[Haskell-cafe] trivial function application question

Chris Kuklewicz haskell at list.mightyreason.com
Fri Jan 5 09:19:33 EST 2007


tphyahoo wrote:
> So the core question (speaking as a perler) is how do you write
> 
>   my $s= 'abcdefg';
>   $s =~ s/a/z/g;
>   $s =~ s/b/y/g;
>   print "$s\n";
> 
>  in haskell? There are various haskell regex libraries out there,
>  including ones that advertise they are PCRE (Perl Compatible Reg Ex).
> 

I updated the regex libraries for GHC 6.6. ( All the regex-* packages. )  The
old API is still supported in Text.Regex.  The old API has a replacement
function, while the new API does not have one (yet).

For simple regular expressions, where Posix and Perl agree, you can just use
Text.Regex.subRegex which comes with GHC.  In 6.6 this comes in the regex-compat
package, and which calls the regex-posix backend via the interfaces defined in
regex-base.  All of these come with GHC, since GHC needs regex support to
compile itself.

So if you do not need more syntax than POSIX regex (with back references) then

http://www.haskell.org/ghc/docs/latest/html/libraries/regex-compat/Text-Regex.html#v%3AsubRegex

works, but depends on the low performance posix-regex backend.  This will run
your example above, for instance.

Better regex searching performance can be had by using the new interface via
Text.Regex.Base with better backends and/or with Data.ByteString.  In the future
there will be Data.Sequence (of Char and perhaps Word8) support added to the
backends.

There is no updated API for performing replacements using a pluggable backend.
The design space is too large with conflicting needs to be lazy or strict, time
or space efficient, etc.  The best thing is to write the replacement function
that your application needs.  You can use the new searching API (see
micro-tutorial below) to write a replacement routine in less than a screen of code.

For instance, the regex-compat version of Text.Regex.subRegex is

> {- | Replaces every occurance of the given regexp with the replacement string.
> 
> In the replacement string, @\"\\1\"@ refers to the first substring;
> @\"\\2\"@ to the second, etc; and @\"\\0\"@ to the entire match.
> @\"\\\\\\\\\"@ will insert a literal backslash.
> 
> This is unsafe if the regex matches an empty string.
> -}
> subRegex :: Regex                          -- ^ Search pattern
>       -> String                         -- ^ Input string
>       -> String                         -- ^ Replacement text
>       -> String                         -- ^ Output string
> subRegex _ "" _ = ""
> subRegex regexp inp repl =
>     let bre = mkRegex "\\\\(\\\\|[0-9]+)"
>         lookup _ [] _ = []
>         lookup [] _ _ = []
>         lookup match repl groups =
>             case matchRegexAll bre repl of
>                 Nothing -> repl
>                 Just (lead, _, trail, bgroups) ->
>                     let newval = if (head bgroups) == "\\"
>                                  then "\\"
>                                  else let index = (read (head bgroups)) - 1
>                                           in
>                                           if index == -1
>                                              then match
>                                              else groups !! index
>                         in
>                         lead ++ newval ++ lookup match trail groups
>         in
>         case matchRegexAll regexp inp of
>             Nothing -> inp
>             Just (lead, match, trail, groups) ->
>               lead ++ lookup match repl groups ++ (subRegex regexp trail repl)

You could just paste that code into a file that imports a different backend and
it should work since it uses just the type class API. You might also improve on
the above routine or specialize it.  The above handle \0 \1 \2 substitutions
(and \\ escaping) in the replacement string, including multi-digit references
such as \15 for very large regular expressions.  It operation only on [Char] and
is somewhat lazy.

>  But which one to use? How hard to install? With the libs mentioned
>  above, the "PCRE"-ness seems only to be for matching, not for
>  substitutions.
> 

I think if you paste the subRegex code above underneath an "import
Text.Posix.PCRE" declaration then you get what you are looking for.

To install:

The regex-* package hosting is via darcs and has been copied/moved to

http://darcs.haskell.org/packages/   (The stable regex-* versions)
http://darcs.haskell.org/packages/regex-unstable/  (The unstable regex-* versions)

so "darcs get --partial http://darcs.haskell.org/packages/regex-pcre" might be
useful.

They have (hopefully working) cabal files to make compiling and installing easy.
 Note that regex-pcre and regex-tre need libpcre and libtre to be installed
separately.  regex-posix needs a posix library, but GHC already provides this
package with a working libary.

These 3 come with GHC:

regex-base defines the type classes and APIs and most RegexContext instances
regex-compat imitates the old Text.Regex API using regex-posix
regex-posix backend has awful performance.  Not for heavy use.

These 4 backends can be downloaded using darcs:

regex-pcre uses libpcre and this imitates the PERL search syntaxs and semantics.
regex-tre used libtre, a very fast posix-compatible (but LGPL) library.
regex-parsec which is not very speedy, but is pure Haskell
regex-dfa is pure haskell and fast, but cannot do subexpression capture yet, and
has some problems (repeating an empty-matching pattern like "(a*)*" will create
an infinite loop in compiling the regex).

regex-devel is for hosting test suites and benchmarks.  You won't need this.

>  http://www.cs.chalmers.se/~d00nibro/harp/
>  http://repetae.net/john/computer/haskell/JRegex/

JRegex is mostly superseded by the more flexible interface of regex-base.  This
flexibility allowed Data.ByteString to be used for efficiency as well as String.

The new API for searching with regex works like this (a micro-tutorial):

import Text.Regex.Base
import Text.Regex.Posix -- or PCRE or Parsec or TRE or DFA

The above should provide (=~) and (=~~) matching operators, which do
("text to be searched" =~ "regex to use") operations.  The value returned is any
of the instances of RegexContext (See documentation for
Text.Regex.Base.Context).  So in a numerical context (i.e. you demand an Int
type) the search returns the number of matches, and in a more complicated
context it returns all matches including all their captures (\0 \1 \2 \3 ...).

=~ and =~~ differ in that the second operates in a Monadic context and this has
access to a "fail" method when the search does not succeed.

But =~ and =~~ are convenience methods for
  1st : converting the regex string to a compiled form with 'makeRegex'
  2nd : performing the search with 'match' (for =~) or 'matchM' (for =~~)
The error handling (such as with a malformed regex) is primitive.

The RegexMaker class defines 'makeRegex' as well as a more complicated
'makeRegexOpts' which lets you specify backend specific options (e.g.
case-sensitivity).  The error handling is primitive.

The RegexContext class defines match and matchM which are still at a high level
of the API and the return type is polymorphic (i.e. depends on the context of
the call much like in PERL).  The error handling is primitive.

The RegexContext class does its magic by using the RegexLike class methods.  The
RegexLike methods are a medium level API and are implemented by instances in
such modules as Text.Regex.Posix.String and Text.Regex.Posix.ByteString.
The RegexMaker 'makeRegex' instances are also defined in these modules.  The
error handling is primitive.

If you import a specific module such as Text.Regex.Posix.String then you can
access the three lower-level API functions:
  'compile' : used to build makeRegexOpts/makeRegex for RegexMaker
  'execute' : used to perform a search and get back offsets and lengths of
matching \0 \1 \2, etc, where \0 is the whole match.
  'regexec' : used to perform a search and get back the actual substrings:
(before match,the whole match,after the match,[list of captured sequences])

These three lower-level API functions are available in all the backends, and use
"Either String a" to report errors in a sensible way.

If for some very bizarre reason you don't like my API's then you can import
whatever modules happen to have the raw API for that backend.  For PCRE this is
Text.Regex.PCRE.Wrap (which is Wrap.hsc file in the source, not Wrap.hs).  This
has the FFI interface to the libpcre library and a bunch of haskell enumerations
 and types defined, and exports functions with names starting with "wrap" that
handle correctly calling the c library from Haskell.  These are very specific to
the PCRE backend, and the errors are always returned in a very detailed way via
Either types.

> 
>  So, I would like to know a good answer to this as well.
> 
>  thomas.
> 

Feel free to ask for more help and better answer.


More information about the Haskell-Cafe mailing list