Announcing regex-tre-0.66 and benchmarks
Chris Kuklewicz
haskell at list.mightyreason.com
Wed Aug 9 10:13:30 EDT 2006
skaller wrote:
> On Mon, 2006-08-07 at 20:38 +0100, Chris Kuklewicz wrote:
>> skaller wrote:
>
>>> Wouldn't it be nice to use Ville Laurikari's TRE
>>> package instead of PCRE?
>>>
>>> [It is also Posix compliant and drop in replacement for
>>> gnu regex .. as well as supporting nice extensions]
>>>
>> It is possible to add support for more backends.
>> The more the merrier, no need to replace anything.
>> I have never heard of TRE before.
>
> TRE is actually based on mathematics.
I have added TRE as a new backend. (I have libtre-0.7.4 on OS X, it is LGPL).
And I have a benchmark, which I will now copy and paste. On "b(..?c)?d": PCRE
is fast, TRE is almost as fast, the DFA (without subexpression capture) is a
close 3rd. The Parsec backend is about 4 times slower, and the Posix library
scales poorly, being far slower on large data sets. The code is in darcs
http://evenmere.org/~chrisk/trl/head/ under regex-devel/bench
Benchmark: find all matches and substring capture for "b(..?c)?d" in a
one million character string on disk. Print the count, the first, and
the last match (with captures).
The benchmark program was, for String:
> module Main(main) where
>
> import Text.Regex.XXX
>
> filename = "datafile"
> regex = "b(..?c)?d"
> main = do
> input <- readFile filename
> let a :: [[String]]
> a = input =~ regex
> b :: Int
> b = length a
> print (b,head a,last a)
and for ByteString:
> module Main(main) where
>
> import Text.Regex.XXX
> import qualified Data.ByteString as B
>
> default (Int)
>
> filename = "datafile"
> regex = "b(..?c)?d"
>
> main = do
> input <- B.readFile filename
> let a :: [[B.ByteString]]
> a = input =~ regex
> b :: Int
> b = length a
> print (b,head a,last a)
where XXX was replaces by PCRE, Parsec, DFA, PosixRE, TRE and compile with "ghc -O2"
Data file is 10^6 characters from permutations of the set "abcdbcdcdd\n"
Using the 10^6 character datafile and String or ByteString.
(The DFS uses a different semantics so a dot matches a newline, so the matching
is different.)
The output and user+sys reported by the time command:
BenchPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 1.294s
BenchTRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 2.128s
BenchDFA (107811,["bcdcd"],["bbccd"])
total is 2.313s
BenchParsec (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 8.094s
BenchPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 91.435s
BenchBSPCRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 0.932s
BenchBSTRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 1.297s
BenchBSDFA (107811,["bcdcd"],["bbccd"])
total is 1.437s
BenchBSParsec (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 8.496s
BenchBSPosixRE (102363,["bcdcd","cdc"],["bbccd","bcc"])
total is 89.780
For 10^5 characters on String:
PCRE 0.077s
DFA 0.131s
TRE 0.206s
PosixRE 0.445s
Parsec 0.825s
Old Posix 43.760s (Text.Regex using splitRegex)
Old Text.Regex took 43.76 seconds on 10^5 characters to do a task
comparable to the one the others did (it ran splitRegex). The new
PosixRE wrapping took 0.445 seconds instead. Yes it is two orders of
magnitude faster, and this is because my wrapping only marshals the
String to CString once. Laziness cannot be worth 2 orders of
magnitude of runtime. This is why we needed a new wrapping, which has
grown into the new library.
On a 10^7 length data set:
PCRE (979261,["bcdcd","cdc"],["bd",""])
time 17.388s
TRE (979261,["bcdcd","cdc"],["bd",""])
time 17.880s
DFA (1063961,["bcdcd"],["bd"])
time 21.617s
Parsec (979261,["bcdcd","cdc"],["bd",""])
time 87.330s
BenchBSPCRE (979261,["bcdcd","cdc"],["bd",""])
time 8.322s
BenchBSTRE (979261,["bcdcd","cdc"],["bd",""])
time 12.644s
BenchBSDFA (1063961,["bcdcd"],["bd"])
time 14.115s
BenchBSParsec (979261,["bcdcd","cdc"],["bd",""])
time 83.395s
More information about the Libraries
mailing list