[Haskell-cafe] Efficient parallel regular expressions
Martijn van Steenbergen
martijn at van.steenbergen.nl
Wed Nov 5 08:56:53 EST 2008
Hello everyone,
Thank you all for your comments! Those are some very useful ideas.
I think I'll try roger's (private) and ChrisK's suggestion first: using
the match groups. I'm not sure if the match groups inside the individual
regexes will cause much trouble, but we'll see. I imagine I'll have to
count parentheses, except when it's followed by a \, except when that \
follows another \, etc. There's probably other situations where a ()
doesn't count as a group, perhaps when it's followed by a * or +. I'll
look into that.
If that doesn't work out I'll go for Neil's (from an algorithmic POV
beautiful) suggestion.
While I understand that some of you suggest I use parsec (or some other
mature parser library) I'm pretty sure that's not what I want here. The
patterns will almost always be very simple and regular expressions offer
an extremely concise way of expressing when a hook should fire. Forcing
the user to use full parsers would cause the programs to become much
more verbose. Still, Yogurt is flexible enough to allow the user to use
parsec if he or she so chooses.
Thanks again,
Martijn.
Mitchell, Neil wrote:
> Hi Martijn,
>
> It's not that tricky if you do a regular expression state machine
> yourself, but that's probably a bit too much work. One way to get a
> speed up might be to take the regular expressions a,b,c,d and
> generate a regex a+b+c+d, and one a+b. You can then check any string
> s against a+b+c+d, if that matches check a+b, if that matches check
> a. At each stage you eliminate half the regular expressions, which
> means a match will take log n, where n is the number of regular
> expressions.
>
> This assumes the underlying regular expression engine constructs a
> finite state machine, making it O(m) where m is the length of the
> string to match.
>
> Thanks
>
> Neil
More information about the Haskell-Cafe
mailing list