[Haskell-cafe] regex-applicative library needs your help! (algorithmic challenge)

Thu Sep 15 11:23:15 CEST 2011

If you are worried about Denial Of Service attacks then you are worried about
(1) memory explosion, and (2) long runtime.  Long runtime can and should be
solved by running a timeout connected to a thread killer, not at the level of
the regular expression engine.  The memory explosion is more interesting, and
here the engine's tradeoffs have something to say:

Since I know simple POSIX regular expression matching better, here is an
overview of this without subexpression capture:

(1) The full DFA may be exponentially large and thus cannot be built ahead of time.

(2) Building the DFA in a lazy fashion will cause memory to grow linearly, which
will be a problem if left to run too long.

(2.1) Pure lazy DFA building won't have a way to determine the currently
computed side of the DFA, and is not going to help you.

(2.2) Cheating the "pure" with unsafePerformIO might be able to keep a count of
the number of computed DFA nodes.  Good luck with thread safety.

(2.3) Making it a stateful expanding DFA would allow you to control and monitor
the growing number of DFA states.

(2.4) Adjust the stateful DFA to keep a limited cache of computed DFA nodes.

(3) Building an NFA from your regular expression can be cheap, but you would
have be more clever than expanding "a(b{2,99}c){100}d" to have 100 copies of 99
copies of b.

(4) Using an NFA with multiple states (breadth-first) is like the limit of a DFA
with only a single computed node at a time.  The number of states active in the
pure NFA can be easily counted and bounded to keep the system from using too
much memory.  Without the lazy or stateful caching of DFA states this may be
slightly slower, depending on whether the caches states would be re-used heavily
or not.

If you want POSIX subexpression capture then your memory required go up by a
factor that is, worst case, proportional to the regular expression size.  This
worst case factor for the "a(b{2,99}c){100}d" does have to track 100 copies of
99 copies of the tracking information.  ( The size of the NFA can still be
clever when adding capture, but the tracked information you now need cannot be
so clever ).

The Russ Cox engine in re2 uses NFA/DFA techniques to achieve Perl semantics,
where it can check the possibilities in parallel.  The same problems with
exploding DFA size and tracked information may still apply.

On 14/09/2011 07:32, Roman Cheplyaka wrote:
> * Eugene Kirpichov <ekirpichov at gmail.com> [2011-09-14 08:38:10+0400]
>> Hi,
>> I don't see how fallback to NFA simulation is really a failure wrt DoS
>> attacks. It's not exponential in time or memory, just linear memory
>> (in size of regex) instead of constant, and slower than DFA.
> 
> Hi Eugene, thanks for pointing that out.
> 
> Indeed, I now see that he uses breadth-first rather than depth-first
> search. Then he has the same problem as I do. I shall study his code
> more closely.
>