Roman Cheplyaka roma at ro-che.info
Sun Sep 18 10:47:53 CEST 2011

```* Anton Tayanovskyy <anton.tayanovskyy at gmail.com> [2011-09-17 21:46:57-0400]
> So you want to encode priorities efficiently as far as I understand
> from [1]? Could bit-packing combined with prefix elimination do the
> trick? Choice boils down to binary choice. Attach a number N to every
> execution thread that sits in a given NFA state. When the thread moves
> through a binary choice state, it splits into two threads with
> N_{left} = 2 * N + 1 and N_{right} = 2 * N. When two threads arrive at
> the same NFA state, the machine picks one with greater N and discards
> the other, thus giving priority to early over late and left over
> right. This makes these numbers grow quickly, but at any point in the
> simulation one can identify the common binary prefix of all the thread
> numbers and remove it, because it will not be relevant for comparison.
> If this is too costly, amortize and do it every K steps instead of on
> every step.
>
> Anton
>
> [1] https://github.com/feuerbach/regex-applicative/wiki/Call-For-Help

Just common prefix elimination is not sufficient to put a hard bound on
the memory usage. For example, something like /(a*)|((aa)*)/ would still
require an amount of space linear in the string length (if the string
consists of many a's).

More clever techniques that I tried ("monotonically map lists to
smaller lists") are hard to make correct.

I like the approach of Russ Cox[2]. One of the great ideas there (which I
think he didn't emphasize enough) is to have a queue which allows O(1)
testing whether an element is already there [3]. This solves the problem
with priorities -- the states are guaranteed to be enqueued in the order
of their priorities, and there are no repetitions.

To be able to process the states in the proper order I'll have to give
up Glushkov automaton and probably use something similar to your
construction [4].

[2] http://swtch.com/~rsc/regexp/regexp2.html