[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
Christopher Done
chrisdone at googlemail.com
Mon Nov 15 12:48:15 EST 2010
On 13 November 2010 16:46, Neil Mitchell <ndmitchell at gmail.com> wrote:
>> I've been working on a project that requires me to do screen scraping.
>
> If you are screen scraping HTML I think tagsoup is a very good choice.
> The use of tagsoup means that you have a real HTML 5 compliant parser
> underneath, and then you can use whatever technique you wish to split
> up the page text - and regular expressions/parsec might be a
> reasonable choice. I've written lots of screen scraping stuff with
> tagsoup, and it's usually very easy - the manual even walks you
> through a couple of examples:
> http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm
Agreed, the tagsoup library just works. I've used it plenty of times
for my scraping needs. E.g. scraping from paste sites:
https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84
https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65
You can always regex match on what tagsoup gives you, too.
More information about the Haskell-Cafe
mailing list