[Haskell-cafe] Web processing

Jeremy Shaw jeremy at n-heptane.com
Sat Aug 2 21:47:04 EDT 2008


Hello,

I would recommend using TagSoup:

http://www-users.cs.york.ac.uk/~ndm/tagsoup/

The tutorial easy, and has good advice:

http://www.cs.york.ac.uk/fp/darcs/tagsoup/tagsoup.htm

I would not bother trying to use a real XML parser, because I suspect
that many of the XHTML pages you want to parse, are not actually valid
XHTML, which means the XML parsers will fail. Also, some of the sites
you are interested in might not be XHTML at all. So, using TagSoup for
everything seems simpliest.

The process is very lo-fi. Write some code using TagSoup which scrapes
the data you care about from the web pages and turns it into Haskell
data structures. This code should not be clever, and it will need to
be updating whenever the site you are scraping changes enough to break
your code.

This process should work fine if you are talking about scraping data
from some specific sites.

If you want to make a web crawler which automatically finds relevant
pages and scrapes the data, then that is a much bigger project. You
will still want to use something like TagSoup to do the initial
parsing, but extracting the data will be much trickier (though,
possibly worth billions of $$$ if done well).

j.

ps. I only have experience with TagSoup, so there may be other
libraries which are even better. The key feature of TagSoup is that it
allows you to process malformed, invalid HTML -- which is important if
you don't control the creation of the HTML you are parsing.

At Sat, 02 Aug 2008 22:10:36 -0300,
Rafael C. de Almeida wrote:
> 
> Hello,
> 
> I understand that nowadays there are several frameworks and wrapper
> libraries for making some sense of the XHTML documents you find over the
> web. That is, making the life of those who want to process the
> semi-structured data you find on the sites.
> 
> I don't have much experience on that field myself, but I want to learn a
> little more about how I can, for instance, associate information from
> one site with information in another site. Even though it is structured
> differently in both places. Does anyone know about libraries that would
> help me out with that sort of work? Hope I'm being clear.
> 
> []'s
> Rafael
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


More information about the Haskell-Cafe mailing list