[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

Christopher Done chrisdone at googlemail.com
Mon Nov 15 12:48:15 EST 2010


On 13 November 2010 16:46, Neil Mitchell <ndmitchell at gmail.com> wrote:
>> I've been working on a project that requires me to do screen scraping.
>
> If you are screen scraping HTML I think tagsoup is a very good choice.
> The use of tagsoup means that you have a real HTML 5 compliant parser
> underneath, and then you can use whatever technique you wish to split
> up the page text - and regular expressions/parsec might be a
> reasonable choice. I've written lots of screen scraping stuff with
> tagsoup, and it's usually very easy - the manual even walks you
> through a couple of examples:
> http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm

Agreed, the tagsoup library just works. I've used it plenty of times
for my scraping needs. E.g. scraping from paste sites:

https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84

https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65

You can always regex match on what tagsoup gives you, too.


More information about the Haskell-Cafe mailing list