[Haskell] major speed improvement: regex-tdfa reaches 1.0.0

Sun Mar 1 17:36:20 EST 2009

Announcing the version 1.0.0 release of regex-tdfa.

I am proud of this release.
This is not just a bug fix release.
It is a serious improvement in the asymptotic running time.

The previous versions allowed bad combinations of pattern and searched
text length to scale badly in the length of the text.  Previously the
worst case for text of length N was O(N^3).

The new worst case asymptotic runtime scaled as O(N).
There is never any backtracking.
And the worst case storage space is independent of N.

The package is on hackage at
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa
The darcs repository is at
http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/

All non-overlapping matches are found and returned, along with all
captured (parenthesized) subexpressions.  The result is precisely what
Posix extended regular expressions are supposed to return.

To be concrete, consider example text with length of N of (2^n)+2:

 > longInput = replicate (2^n) 'a' ++ "bc"

And define 4 worst-case-scenario patterns.  I wrote the code and I
know how to kill it:

 > wcs0 = "a*b"
 > wcs1 = "a*c"
 > wcs2 = "(a|aa)*b"
 > wcs3 = "(a|aa)*c"

wcs0 is easy.
wcs1 causes the old code to backtrack.
wcs2 causes the old code's storage to scale as O(N).
wcs3 causes both backtracking and O(N) storage with the old code.

The old code's time scales as O(N) for wcs0, O(N^2) for wcs1 and wcs2,
and O(N^3) for wcs3.  The new code is always O(N).  The actual timings
for the old code on my G4 laptop for wcs on 2^8 and 2^9 and 2^10 are:

Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 8 +RTS -sstderr 2>&1  | head -4
./Test-TDFA-np wcs3 8 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(257,1)),(1,(-1,0))]]
      263,776,756 bytes allocated in the heap
user	0m1.017s
sys	0m0.058s

Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 9 +RTS -sstderr 2>&1  | head -4
./Test-TDFA-np wcs3 9 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(513,1)),(1,(-1,0))]]
    1,998,647,452 bytes allocated in the heap
user	0m7.083s
sys	0m0.289s

Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 10 +RTS -sstderr 2>&1  | 
head -4
./Test-TDFA-np wcs3 10 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]]
   15,653,277,200 bytes allocated in the heap
user	0m53.350s
sys	0m2.056s

The doubling of length is causing a nearly eight-fold time increase.
The heap allocation is also increasing nearly eight-fold.

The new code with the same input pattern and texts gives:

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr 
2>&1  | head -4
./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(257,1)),(1,(-1,0))]]
        2,135,324 bytes allocated in the heap
user	0m0.017s
sys	0m0.017s

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr 
2>&1  | head -4
./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(513,1)),(1,(-1,0))]]
        3,588,656 bytes allocated in the heap
user	0m0.024s
sys	0m0.017s

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr 
2>&1  | head -4
./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]]
        6,345,436 bytes allocated in the heap
user	0m0.038s
sys	0m0.018s

Note that the heap allocation for the 1026 character example above is
2466 times less than the old code.

That was too fast to prove the scaling, so take more input:

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr 
2>&1  | head -4
./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(1048577,1)),(1,(-1,0))]]
    5,708,574,848 bytes allocated in the heap
user	0m26.023s
sys	0m0.985s

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr 
2>&1  | head -4
./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr
Test for [Char]
Right [array (0,1) [(0,(2097153,1)),(1,(-1,0))]]
   11,416,354,056 bytes allocated in the heap
user	0m52.656s
sys	0m1.985s

The length and time both doubled, as did the heap allocation.  And the
new code has searched two million characters in the time the old code
searched one thousand.

How about away from the worst case scenario?  On the testing suite the
new code is running slightly slower:

Reason:compare-tdfa chrisk$ time ./Test-TDFA-np -r 1 100 > /dev/null
user	0m4.841s
sys	0m3.019s

Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 -r 1 100 > /dev/null
user	0m5.970s
sys	0m3.012s

So that is an increase of execution time of 14%.  This small dip in
performance might be reclaimable with more optimization.  I think the
gain in worst case performance already offsets the slight cost.

The code for String is complete.  The strict and lazy bytestrings and
the (Seq Char) are currently using the String code for matching.  This
will be improved in a future release.

Cheers,
   Chris

The small print: regex-tdfa will still behave badly in space and time
if given a pathological pattern,
e.g. "(((a|aa){0,100}){0,100}){0,100}".  But, I am hopeful than I can
improve regex-tdfa to behave well with the patterns like this one.
That is my vague goal for a future version 2.0.0 release.

The very small print: I have been using ghc-6.10.1, and I think I have
carried over cabal switches to make ghc-6.8.3 also work.  The library
probably can work with ghc-6.6, but the cabal file will need editing
first as well as fixes to "Text.Regex.TDFA.NewDFA.copySTU".