[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Fri Nov 12 19:41:51 EST 2010

On 11/12/10 6:56 PM, Michael Litchard wrote:
> I've been working on a project that requires me to do screen scraping.
> When I first started this, I worked off of other people's examples.
> Not one used regex. By luck I found someone at work to help me along
> this project. His clues and hints don't use regex either. I was at a
> point where I had to make a decision concerning design, so I asked the
> guy sitting next to me at work. He's very experienced, and comes from
> a Perl perspective. I let him into what I was doing, and he opined I
> should be using pcre. So now I'm second guessing my choices. Why do
> people choose not to use regex for uri parsing?

As the grammar becomes more complex (i.e., as your patterns become more 
nuanced), using a real parser framework helps to improve code legibility 
since you can factor parts of the grammar out, give them names, etc. In 
addition to the documentation effects, this refactoring also allows you 
to make your grammars modular by using the same subgrammar in multiple 
places. While technically you can do the same factoring for constructing 
the regex that gets handed off to pcre, almost noone does that in practice.

Also, using a real parsing framework allows you to construct more 
powerful grammars than regular grammars, so if you need the power of 
unbounded recursion or of context sensitivity, then regular expressions 
are out. Technically Perl's regexen are Turing complete and aren't 
"regular expressions" at all; pcre has inherited some of that extra 
power, put the point still holds at large.

Even with more restricted regexen than Perl has, the modern idea of a 
"regex" isn't regular at all. Beginning of sentence and end of sentence 
anchors are not regular properties, which allows you to have the worst 
kind of fun :)

     http://zmievski.org/2010/08/the-prime-that-wasnt

Even if you did decide to go for regular expressions, pcre chooses a 
specific implementation for handling choice (namely backtracking 
search). Depending on your grammars and the text they'll be applied to, 
this may not be the most efficient implementation since backtracking can 
lead to exponential behaviors that other regex implementations don't have.

Also, regexes are apparently very difficult to implement *correctly*:

     http://www.haskell.org/haskellwiki/Regex_Posix

-- 
Live well,
~wren