Announcing regex-tre-0.66 and benchmarks

Wed Aug 9 10:13:30 EDT 2006

skaller wrote:
> On Mon, 2006-08-07 at 20:38 +0100, Chris Kuklewicz wrote:
>> skaller wrote:
> 
>>> Wouldn't it be nice to use Ville Laurikari's TRE 
>>> package instead of PCRE?
>>>
>>> [It is also Posix compliant and drop in replacement for
>>> gnu regex .. as well as supporting nice extensions]
>>>
>> It is possible to add support for more backends.
>> The more the merrier, no need to replace anything.
>> I have never heard of TRE before.
> 
> TRE is actually based on mathematics.

I have added TRE as a new backend.  (I have libtre-0.7.4 on OS X, it is LGPL).

And I have a benchmark, which I will now copy and paste.  On "b(..?c)?d": PCRE 
is fast, TRE is almost as fast, the DFA (without subexpression capture) is a 
close 3rd.  The Parsec backend is about 4 times slower, and the Posix library 
scales poorly, being far slower on large data sets.  The code is in darcs 
http://evenmere.org/~chrisk/trl/head/ under regex-devel/bench

Benchmark: find all matches and substring capture for "b(..?c)?d" in a
one million character string on disk.  Print the count, the first, and
the last match (with captures).

The benchmark program was, for String:

 > module Main(main) where
 >
 > import Text.Regex.XXX
 >
 > filename = "datafile"
 > regex = "b(..?c)?d"
 > main = do
 >   input <- readFile filename
 >   let a :: [[String]]
 >       a =  input =~ regex
 >       b :: Int
 >       b = length a
 >   print (b,head a,last a)

and for ByteString:

 > module Main(main) where
 >
 > import Text.Regex.XXX
 > import qualified Data.ByteString as B
 >
 > default (Int)
 >
 > filename = "datafile"
 > regex = "b(..?c)?d"
 >
 > main = do
 >   input <- B.readFile filename
 >   let a :: [[B.ByteString]]
 >       a =  input =~ regex
 >       b :: Int
 >       b = length a
 >   print (b,head a,last a)

where XXX was replaces by PCRE, Parsec, DFA, PosixRE, TRE and compile with "ghc -O2"

Data file is 10^6 characters from permutations of the set "abcdbcdcdd\n"

Using the 10^6 character datafile and String or ByteString.
(The DFS uses a different semantics so a dot matches a newline, so the matching 
is different.)
The output and user+sys reported by the time command:

BenchPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 1.294s

BenchTRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is  2.128s

BenchDFA (107811,["bcdcd"],["bbccd"])
total is 2.313s

BenchParsec (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 8.094s

BenchPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 91.435s

BenchBSPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 0.932s

BenchBSTRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 1.297s

BenchBSDFA (107811,["bcdcd"],["bbccd"])
total is 1.437s

BenchBSParsec (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 8.496s

BenchBSPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 89.780

For 10^5 characters on String:
PCRE       0.077s
DFA        0.131s
TRE        0.206s
PosixRE    0.445s
Parsec     0.825s
Old Posix 43.760s  (Text.Regex using splitRegex)

Old Text.Regex took 43.76 seconds on 10^5 characters to do a task
comparable to the one the others did (it ran splitRegex).  The new
PosixRE wrapping took 0.445 seconds instead.  Yes it is two orders of
magnitude faster, and this is because my wrapping only marshals the
String to CString once.  Laziness cannot be worth 2 orders of
magnitude of runtime.  This is why we needed a new wrapping, which has
grown into the new library.

On a 10^7 length data set:

PCRE (979261,["bcdcd","cdc"],["bd",""])
time 17.388s

TRE (979261,["bcdcd","cdc"],["bd",""])
time 17.880s

DFA (1063961,["bcdcd"],["bd"])
time 21.617s

Parsec (979261,["bcdcd","cdc"],["bd",""])
time 87.330s

BenchBSPCRE (979261,["bcdcd","cdc"],["bd",""])
time 8.322s

BenchBSTRE (979261,["bcdcd","cdc"],["bd",""])
time 12.644s

BenchBSDFA (1063961,["bcdcd"],["bd"])
time 14.115s

BenchBSParsec (979261,["bcdcd","cdc"],["bd",""])
time 83.395s