[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Neil Mitchell ndmitchell at gmail.com
Sat Nov 13 10:46:50 EST 2010

> I've been working on a project that requires me to do screen scraping.

If you are screen scraping HTML I think tagsoup is a very good choice.
The use of tagsoup means that you have a real HTML 5 compliant parser
underneath, and then you can use whatever technique you wish to split
up the page text - and regular expressions/parsec might be a
reasonable choice. I've written lots of screen scraping stuff with
tagsoup, and it's usually very easy - the manual even walks you
through a couple of examples:

> He's very experienced, and comes from
> a Perl perspective. I let him into what I was doing, and he opined I
> should be using pcre.

When all you have is a hammer, everything looks like a thumb.
Structured manipulation of algebraic data types is trivial in Haskell,
and much less natural in Perl, so they use different techniques in
different places.

> So now I'm second guessing my choices. Why do
> people choose not to use regex for uri parsing?

If you mean HTML parsing, then it's because it's a nightmare to get
right, and people on the web do all kinds of crazy stuff. A correct
regular expression to match an HTML tag is lots of work. Given that
it's a solved problem, why go to all that effort. It is possible to do
with regular expressions, but not pleasant.

Thanks, Neil

More information about the Haskell-Cafe mailing list