[Haskell-cafe] Regular Expressions without GADTs

oleg at pobox.com oleg at pobox.com
Tue Oct 18 20:20:38 EDT 2005

Conor McBride wrote:
> Neither Oleg nor Bruno translated my code; they threw away my
> structurally recursive on-the-fly automaton and wrote combinator parsers
> instead. That's why there's no existential, etc. The suggestion that
> removing the GADT simplifies the code would be better substantiated if
> like was compared with like.

The following code specifically tries to be as close to the original
Conor McBride's code as possible. I only removed local type
annotations (turn out unnecessary) and replaced `ap' with `liftM' as a
matter of personal preference.

> Where we came in, if you recall, was the use of a datatype to define
> regular expressions. I used a GADT to equip this type with some useful
> semantic information (namely a type of parsed objects), and I wrote a
> program (divide) which exploited this representation to compute the
> regexp the tail of the input must match from the regexp the whole input
> must match. This may be a bit of a weird way to write a parser, but it's
> a useful sort of thing to be able to do...

That particular feature is fully preserved. Data types are used to
define regular expression:

> p = (Star (Mult (Star (Check (== 'a'))) (Star (Check (== 'b')))))
> testp = parse p "abaabaaabbbb"

  *RX> :t p
  p :: Star (Mult (Star (Check Char)) (Star (Check Char)))

the regular expression is apparent in the type of the `parser' (or its
representation, to be precise). One can easily define data types
Digit, Alphanumeric, WhiteSpace, etc. and so type will look more
informative. Here's one step to regular types... Instead of GADT, a
type class is used to associate semantic information with the data
type (with labels). Typing is more explicit now.

> And that's the point: not only does the GADT have all the disadvantages
> of a closed datatype, it has all the advantages too!

A type class can be closed too, thanks to functional dependencies.

You were right: the code below is just the mechanical reshuffling of
the original code, using the translation I come across a year
ago. `empty' and `division' are now grouped by the parser type, which
I personally like. It is hard to see any loss of expressiveness by
getting rid of GADTs in this case. More explicit type is an
advantage. Also an advantage is the fact that `pattern match' is
guaranteed exhaustive. If I forgot to implement some particular
parser, that would be a type error (rather than a pattern-match
run-time error, as is the case with ADT/GADT).

> Moreover, once you have to start messing about with equality
> constraints, the coding to eliminate GADTs becomes a lot hairier. I
> implemented it back in '95 (when all these things had different names)
> and I was very glad not to have to do it by hand any more.

The discovery of the presence of type equality in Haskell changes
things a great deal. It would be interesting to see how your code of
'95 will look now. Writing an equality and even _dis-equality_
constraint is just as simple as |TypeEq a b HTrue| or |TypeEq a b

{-# OPTIONS -fglasgow-exts #-}

module RX where

import Control.Monad

class RegExp r tok a | r tok -> a where
   empty :: r -> [tok] -> Maybe a
   divide :: tok -> r -> Division tok a

data Zero = Zero
data One = One
newtype Check tok = Check (tok -> Bool)
data Plus r1 r2 = Plus r1 r2
data Mult r1 r2 = Mult r1 r2
newtype Star r = Star r

data Empty = Empty

data Division tok x = forall r y. RegExp r tok y => Div r (y -> x)

parse :: (RegExp r tok x) => r -> [tok] -> Maybe x
parse r t@[]       = empty r t
parse r (t : ts) = case divide t r of
		      Div q f -> liftM f $ parse q ts

nullt :: tok -> [tok]
nullt _ = []

instance RegExp Zero tok Empty where
    empty  _ _ = mzero
    divide _ _                  = Div Zero naughtE

instance RegExp One tok () where
    empty  _ _ = return ()
    divide _ _                  = Div Zero naughtE

instance RegExp (Check tok) tok tok where
    empty  _ _ = mzero
    divide t (Check p) | p t    = Div One (const t)
    divide _ _                  = Div Zero naughtE

instance (RegExp r1 tok a, RegExp r2 tok b) 
    => RegExp (Plus r1 r2) tok (Either a b) where
    empty (Plus r1 r2) t = (liftM Left  $ empty r1 t) `mplus`
			   (liftM Right $ empty r2 t)
    divide t (Plus r1 r2) =
      case (divide t r1, divide t r2) of
        (Div q1 f1, Div q2  f2) ->
          Div (Plus q1 q2) (f1 +++ f2)

instance (RegExp r1 tok a, RegExp r2 tok b) 
    => RegExp (Mult r1 r2) tok (a,b) where
    empty (Mult r1 r2) t = liftM2 (,) (empty r1 t) (empty r2 t)
    divide t (Mult r1 r2) = 
	case (empty r1 (nullt t), divide t r1, divide t r2) of
		 (Nothing, Div q1 f1, _) -> Div (Mult q1 r2) (f1 *** id)
		 (Just x1, Div q1 f1, Div q2 f2) ->
		     Div (Plus (Mult q1 r2) q2) 
			     (either (f1 *** id) (((,) x1) . f2))

instance RegExp r tok a => RegExp (Star r) tok [a] where
    empty  _ _ = return []
    divide t (Star r) = case (divide t r) of
	Div q f -> Div (Mult q (Star r)) (\ (y, xs) -> (f y : xs))

(***) :: (a -> b) -> (c -> d) -> (a, c) -> (b, d)
(f *** g) (a, c) = (f a, g c)

(+++) :: (a -> b) -> (c -> d) -> Either a c -> Either b d
(f +++ g) (Left a)  = Left  (f a)
(f +++ g) (Right c) = Right (g c)

naughtE :: Empty -> x
naughtE = undefined

p = (Star (Mult (Star (Check (== 'a'))) (Star (Check (== 'b')))))

testp = parse p "abaabaaabbbb"

More information about the Haskell-Cafe mailing list