[Haskell-cafe] Re: "Tying the knot" with unknown keys

Tue Aug 21 08:24:28 EDT 2007

David Ritchie MacIver wrote:
> I was playing with some code for compiling regular expressions to finite
> state machines and I ran into the following problem. I've solved it, but
> I'm not terribly happy with my solution and was wondering if someone
> could come up with a better one. :-)
> 
> Essentially I have
> 
> data FSM = State { transitions :: (Map Char FSM) }
> 
> and
> 
> transitions' :: Regexp -> Map Char Regexp
> 
> I want to lift this so that the Regexps become states of the finite
> state machine (while making sure I set up a loop in the data structure).
> Tying the knot is the traditional way of doing such things, but we
> couldn't figure out a way to make it work without the set of keys known
> in advance because of the strictness of Map in its keys (an association
> list was suggested, and that would probably work, but it seemed a bit
> ugly and would be fairly inefficient).
> 
> In the end what I did was just work out the set of reachable regexps in
> advance and use a standard tying the knot trick, but it felt vaguely
> unsatisfactory (and does some repeat work which I felt should be
> unneccessary). Anyone have a more elegant suggestion?
> 
> Regards,
> David

As others have pointed out, the decision to use Data.Map is a limiting issue.

Another approach is the combinator method in CTK Light
http://www.cse.unsw.edu.au/~chak/haskell/ctk/ which was specialized and enhanced
for regexp in the regex-dfa package [1].  This lazily constructs a DFA from the
regular expression.  The "tying the knot" happens in the definition of the
'star' combinator (the '*' regexp character, also implied by '+').  The problem
with this elegant definition is that it fails badly if the pattern being
repeated might succeed after consuming zero characters (it hangs in an infinite
loop).  Other than that it is a wonderful definition:

> type Regexp = Lexer -> Lexer
>
> -- star re1 re2 means repeat re1 follow with re2: "((re1)*)(re2)"
> star :: Regexp -> Regexp -> Regexp
> star re1 re2  = \l -> let self = re1 self >||< re2 l in self

My regex-tdfa package did something quite different since it has to deal with a
lot of extra complexity.  It works in a few stages, in particular because it
needs the NFA states to handle subexpression captures and because it has to
handle anchors like ^ and $.

String of regexp => Parsec extended regexp parser => parse tree data type

parse tree => complicated analyzer (uses mdo) => smarter tree data type

smarter tree => My complicated assembly monad (uses mdo) => Array Int NFA

Where a simplified description of NFA is something like
data NFA = NFA Int Trans
data Trans = Trans (Map Char (Set Int))-- Might lead to more than one NFA state
I could just as easily have made this
data Trans = Trans (Map Char (Set NFA))
or
data Trans = Trans (Map Char (Set Trans))
by doing a lazy lookup into the array, but it would then not have been as easy
to make the DFA in the next step:

Array Int NFA => Use of Trie indexed by (Set Int) => DFA

Where a simplified DFA is like
data DFA = DFA (Set Int) (Map Char DFA)
and the Trie means I can lazily lookup any subset of NFA state and get their
merge DFA state.

So the procedure starts with a simple empty winning NFA to the "right" of the
parse tree.  The rexgexp tree walk is done in a monad which provides the supply
of unique Int index when a new NFA state is created.  The last NFA state to be
created is the unique start state which always gets the largest Int index.

The "tying the knot" trick in building the NFA was handled by walking the regexp
parse tree where each node is attached to an NFA representing the future
continuation from that node.  The tricky case was the one that kills the simple
"tying the knot" in CTK Light's method: when you have 'p*' and 'p' might match
zero characters.  The continuation needed to describe the future in that case
had to be supplied in a more complicated form while walking 'p' to avoid the
infinite looping.

There are no mutable STRef/IORef variables.  All the NFA nodes that are created
during the monadic traversal are part of the final NFA (so there is no wasted
work even though I make a single walk through the tree).  The resulting NFA is
not as minimal as the differentiation method since my traversal does not look at
whether characters in the regexp are equal (my NFA builder is equivalent to
treating all the regexp characters are distinct).  But this also means I do not
have the combinatoric explosion of regexps that the differentiation method can
produce.  I kept improving the design until it reproduced the same kinds of NFA
graphs I could produce manually under those assumptions.

The typical NFA state represents the condition "you have just accepted character
X in the regexp".  This is different from the Thompson NFA where states usually
mean "you have just accepted a character leading to character X in the regexp".

The NFA that regexp-tdfa produces
  * has no empty transitions
  * Captures extended regexp Posix semantics (difficult with empty matches)
  * handles anchors such as ^ and $ properly
  * efficiently handle inverted matches like [^a-z] with a default transition
  * Tracks parenthesized subexpression for later capture
  * Can handle very tricky cases like {n,m} repetitions

Since it has no empty transitions the NFA states are also the simplest states of
the DFA.  And the DFA states are just merged subsets of the NFA states, thus a
simple Trie works wonders.

Advanced hints: Things like ((^|a)?b*($|c)?)* are ugly.  Getting Posix patterns
with {n,m} repetitions right was difficult -- and quick check found out that the
Mac OS X regex.h actually contains a bug for some uses of {n,m} repetitions.

-- 
Chris

[1] http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-dfa-0.91