[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
wren ng thornton
wren at freegeek.org
Fri Nov 12 19:41:51 EST 2010
On 11/12/10 6:56 PM, Michael Litchard wrote:
> I've been working on a project that requires me to do screen scraping.
> When I first started this, I worked off of other people's examples.
> Not one used regex. By luck I found someone at work to help me along
> this project. His clues and hints don't use regex either. I was at a
> point where I had to make a decision concerning design, so I asked the
> guy sitting next to me at work. He's very experienced, and comes from
> a Perl perspective. I let him into what I was doing, and he opined I
> should be using pcre. So now I'm second guessing my choices. Why do
> people choose not to use regex for uri parsing?
As the grammar becomes more complex (i.e., as your patterns become more
nuanced), using a real parser framework helps to improve code legibility
since you can factor parts of the grammar out, give them names, etc. In
addition to the documentation effects, this refactoring also allows you
to make your grammars modular by using the same subgrammar in multiple
places. While technically you can do the same factoring for constructing
the regex that gets handed off to pcre, almost noone does that in practice.
Also, using a real parsing framework allows you to construct more
powerful grammars than regular grammars, so if you need the power of
unbounded recursion or of context sensitivity, then regular expressions
are out. Technically Perl's regexen are Turing complete and aren't
"regular expressions" at all; pcre has inherited some of that extra
power, put the point still holds at large.
Even with more restricted regexen than Perl has, the modern idea of a
"regex" isn't regular at all. Beginning of sentence and end of sentence
anchors are not regular properties, which allows you to have the worst
kind of fun :)
Even if you did decide to go for regular expressions, pcre chooses a
specific implementation for handling choice (namely backtracking
search). Depending on your grammars and the text they'll be applied to,
this may not be the most efficient implementation since backtracking can
lead to exponential behaviors that other regex implementations don't have.
Also, regexes are apparently very difficult to implement *correctly*:
More information about the Haskell-Cafe