[Haskell-cafe] major speed improvement: regex-tdfa reaches 1.0.0

Eugene Kirpichov ekirpichov at gmail.com
Mon Mar 2 06:27:37 EST 2009


Congratulations! This is really cool.
And this also made me find a bug in GHC :)

http://hackage.haskell.org/trac/ghc/ticket/3059

2009/3/2 ChrisK <haskell at list.mightyreason.com>:
> Announcing the version 1.0.0 release of regex-tdfa.
>
> I am proud of this release.
> This is not just a bug fix release.
> It is a serious improvement in the asymptotic running time.
>
> The previous versions allowed bad combinations of pattern and searched
> text length to scale badly in the length of the text.  Previously the
> worst case for text of length N was O(N^3).
>
> The new worst case asymptotic runtime scaled as O(N).
> There is never any backtracking.
> And the worst case storage space is independent of N.
>
> The package is on hackage at
> http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa
> The darcs repository is at
> http://darcs.haskell.org/packages/regex-unstable/regex-tdfa/
>
> All non-overlapping matches are found and returned, along with all
> captured (parenthesized) subexpressions.  The result is precisely what
> Posix extended regular expressions are supposed to return.
>
> To be concrete, consider example text with length of N of (2^n)+2:
>
>> longInput = replicate (2^n) 'a' ++ "bc"
>
> And define 4 worst-case-scenario patterns.  I wrote the code and I
> know how to kill it:
>
>> wcs0 = "a*b"
>> wcs1 = "a*c"
>> wcs2 = "(a|aa)*b"
>> wcs3 = "(a|aa)*c"
>
> wcs0 is easy.
> wcs1 causes the old code to backtrack.
> wcs2 causes the old code's storage to scale as O(N).
> wcs3 causes both backtracking and O(N) storage with the old code.
>
> The old code's time scales as O(N) for wcs0, O(N^2) for wcs1 and wcs2,
> and O(N^3) for wcs3.  The new code is always O(N).  The actual timings
> for the old code on my G4 laptop for wcs on 2^8 and 2^9 and 2^10 are:
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 8 +RTS -sstderr 2>&1  |
> head -4
> ./Test-TDFA-np wcs3 8 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(257,1)),(1,(-1,0))]]
>     263,776,756 bytes allocated in the heap
> user    0m1.017s
> sys     0m0.058s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 9 +RTS -sstderr 2>&1  |
> head -4
> ./Test-TDFA-np wcs3 9 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(513,1)),(1,(-1,0))]]
>   1,998,647,452 bytes allocated in the heap
> user    0m7.083s
> sys     0m0.289s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA-np wcs3 10 +RTS -sstderr 2>&1
>  | head -4
> ./Test-TDFA-np wcs3 10 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]]
>  15,653,277,200 bytes allocated in the heap
> user    0m53.350s
> sys     0m2.056s
>
> The doubling of length is causing a nearly eight-fold time increase.
> The heap allocation is also increasing nearly eight-fold.
>
> The new code with the same input pattern and texts gives:
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS
> -sstderr 2>&1  | head -4
> ./Test-TDFA2-0.99.19-np2 wcs3 8 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(257,1)),(1,(-1,0))]]
>       2,135,324 bytes allocated in the heap
> user    0m0.017s
> sys     0m0.017s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS
> -sstderr 2>&1  | head -4
> ./Test-TDFA2-0.99.19-np2 wcs3 9 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(513,1)),(1,(-1,0))]]
>       3,588,656 bytes allocated in the heap
> user    0m0.024s
> sys     0m0.017s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS
> -sstderr 2>&1  | head -4
> ./Test-TDFA2-0.99.19-np2 wcs3 10 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(1025,1)),(1,(-1,0))]]
>       6,345,436 bytes allocated in the heap
> user    0m0.038s
> sys     0m0.018s
>
> Note that the heap allocation for the 1026 character example above is
> 2466 times less than the old code.
>
> That was too fast to prove the scaling, so take more input:
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS
> -sstderr 2>&1  | head -4
> ./Test-TDFA2-0.99.19-np2 wcs3 20 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(1048577,1)),(1,(-1,0))]]
>   5,708,574,848 bytes allocated in the heap
> user    0m26.023s
> sys     0m0.985s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS
> -sstderr 2>&1  | head -4
> ./Test-TDFA2-0.99.19-np2 wcs3 21 +RTS -sstderr
> Test for [Char]
> Right [array (0,1) [(0,(2097153,1)),(1,(-1,0))]]
>  11,416,354,056 bytes allocated in the heap
> user    0m52.656s
> sys     0m1.985s
>
> The length and time both doubled, as did the heap allocation.  And the
> new code has searched two million characters in the time the old code
> searched one thousand.
>
> How about away from the worst case scenario?  On the testing suite the
> new code is running slightly slower:
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA-np -r 1 100 > /dev/null
> user    0m4.841s
> sys     0m3.019s
>
> Reason:compare-tdfa chrisk$ time ./Test-TDFA2-0.99.19-np2 -r 1 100 >
> /dev/null
> user    0m5.970s
> sys     0m3.012s
>
> So that is an increase of execution time of 14%.  This small dip in
> performance might be reclaimable with more optimization.  I think the
> gain in worst case performance already offsets the slight cost.
>
> The code for String is complete.  The strict and lazy bytestrings and
> the (Seq Char) are currently using the String code for matching.  This
> will be improved in a future release.
>
> Cheers,
>  Chris
>
> The small print: regex-tdfa will still behave badly in space and time
> if given a pathological pattern,
> e.g. "(((a|aa){0,100}){0,100}){0,100}".  But, I am hopeful than I can
> improve regex-tdfa to behave well with the patterns like this one.
> That is my vague goal for a future version 2.0.0 release.
>
> The very small print: I have been using ghc-6.10.1, and I think I have
> carried over cabal switches to make ghc-6.8.3 also work.  The library
> probably can work with ghc-6.6, but the cabal file will need editing
> first as well as fixes to "Text.Regex.TDFA.NewDFA.copySTU".
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



-- 
Eugene Kirpichov
Web IR developer, market.yandex.ru


More information about the Haskell-Cafe mailing list