[Haskell-cafe] Regular Expressions, was Re: Interest in helping w/ Haskell standard

Martin Sulzmann sulzmann at comp.nus.edu.sg
Mon Oct 17 07:38:59 EDT 2005


Very interesting Conor. Do you know Xi et al's APLAS'03 paper?
(Hongwei, I'm not sure whether you're on this list).
Xi et al. use GRDTs (aka GADTs aka first-class phantom types)
to represent XML documents. There're may be some connections
between what you're doing and Xi et al's work.

I believe that there's an awful lot you can  do with GADTs 
(in the context of regular expressions). But there're two things 
you can't accomplish:
semantic subtyping and regular expression pattern matching
a la XDuce/CDuce. Unless somebody out there proves me wrong.

Martin



Hi folks,

Inspired by Ralf's post, I thought I'd just GADTize a dependently typed=20
program I wrote in 2001.

Wolfgang Jeltsch wrote:

> Now lets consider using an algebraic datatype for regexps:
>
>	data RegExp
>		=3D Empty | Single Char | RegExp :+: RegExp | RegExp :|: RegExpt | Ite=
r RegExp
>
>Manipulating regular expressions now becomes easy and safe =E2=80=93 you=
 are just not=20
>able to create "syntactically incorrect regular expressions" since durin=
g=20
>runtime you don't deal with syntax at all.
> =20
>

A fancier variation on the same theme...

 > data RegExp :: * -> * -> * where
 >   Zero   :: RegExp tok Empty
 >   One    :: RegExp tok ()
 >   Check  :: (tok -> Bool) -> RegExp tok tok
 >   Plus   :: RegExp tok a -> RegExp tok b -> RegExp tok (Either a b)
 >   Mult   :: RegExp tok a -> RegExp tok b -> RegExp tok (a, b)
 >   Star   :: RegExp tok a -> RegExp tok [a]

 > data Empty

The intuition is that a RegExp tok output is a regular expression=20
explaining how to parse a list of tok as an output. Here, Zero is the=20
regexp which does not accept anything, One accepts just the empty=20
string, Plus is choice and Mult is sequential composition; Check lets=20
you decide whether you like a single token.

Regular expressions may be seen as an extended language of polynomials=20
with tokens for variables; this parser works by repeated application of=20
the remainder theorem.

 > parse :: RegExp tok x -> [tok] -> Maybe x
 > parse r []       =3D empty r
 > parse r (t : ts) =3D case divide t r of
 >   Div q f -> return f `ap` parse q ts

Example

*RegExp> parse (Star (Mult (Star (Check (=3D=3D 'a'))) (Star (Check (=3D=3D=
=20
'b'))))) "abaabaaabbbb"
Just [("a","b"),("aa","b"),("aaa","bbbb")]

The 'remainder' explains if a regular expression accepts the empty=20
string, and if so, how. The Star case is a convenient=20
underapproximation, ruling out repeated empty values.
=20
 > empty :: RegExp tok a -> Maybe a
 > empty Zero         =3D mzero
 > empty One          =3D return ()
 > empty (Check _)    =3D mzero
 > empty (Plus r1 r2) =3D (return Left  `ap` empty r1) `mplus`
 >                      (return Right `ap` empty r2)
 > empty (Mult r1 r2) =3D return (,) `ap` empty r1 `ap` empty r2
 > empty (Star _)     =3D return []

The 'quotient' explains how to parse the tail of the list, and how to=20
recover the meaning of the whole list from the meaning of the tail.

 > data Division tok x =3D forall y. Div (RegExp tok y) (y -> x)

Here's how it's done. I didn't expect to need scoped type variables, but=20
I did...

 > divide :: tok -> RegExp tok x -> Division tok x
 > divide t Zero                  =3D Div Zero naughtE
 > divide t One                   =3D Div Zero naughtE
 > divide t (Check p) | p t       =3D Div One (const t)
 >                    | otherwise =3D Div Zero naughtE
 > divide t (Plus (r1 :: RegExp tok a) (r2 :: RegExp tok b)) =3D
 >   case (divide t r1, divide t r2) of
 >     (Div (q1 :: RegExp tok a') (f1 :: a' -> a),
 >       Div (q2 :: RegExp tok b') (f2 :: b' -> b)) ->
 >       Div (Plus q1 q2) (f1 +++ f2)
 > divide t (Mult r1 r2) =3D case (empty r1, divide t r1, divide t r2) of
 >   (Nothing, Div q1 f1, _) -> Div (Mult q1 r2) (f1 *** id)
 >   (Just x1, Div q1 f1, Div q2 f2) ->
 >     Div (Plus (Mult q1 r2) q2) (either (f1 *** id) (((,) x1) . f2))
 > divide t (Star r) =3D case (divide t r) of
 >   Div q f -> Div (Mult q (Star r)) (\ (y, xs) -> (f y : xs))

Bureaucracy.

 > (***) :: (a -> b) -> (c -> d) -> (a, c) -> (b, d)
 > (f *** g) (a, c) =3D (f a, g c)

 > (+++) :: (a -> b) -> (c -> d) -> Either a c -> Either b d
 > (f +++ g) (Left a)  =3D Left  (f a)
 > (f +++ g) (Right c) =3D Right (g c)

 > naughtE :: Empty -> x
 > naughtE =3D undefined

It's not the most efficient parser in the world (doing some algebraic=20
simplification on the fly wouldn't hurt), but it shows the sort of stuff=20
you can do.

Have fun

Conor
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe at haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


More information about the Haskell-Cafe mailing list