[Haskell-cafe] Get data from HTML pages

Mon Aug 31 12:22:33 EDT 2009

José Romildo Malaquias wrote:

> Currently the application has an option to indirectly import movie data
> from web pages. For that first the user should access the page in a web
> browser. Then the user should copy the rendered text in the web browser
> into an import window in my application and click an "import" button. In
> response the application parses the given text and collects any relevant
> data it knows about, using regular expressions.
> 
> For instance, to get the director information from a movie in the
> AllCenter web site I use the following regular expression:
> 
>    ^Direção:\s+(.+)$
> 
> I want to modify this scheme in order to eliminate the need to copy the
> rendered text from a web browser. Instead my application should download
> and parse the HTML page directly.
> 
> Which libraries are available in Haskell that would make it easy to get
> content information from a HTML document, in the way described above?

To parse HTML documents, I've had success with TagSoup in the past. You
can take a look at the HTTP package to download the HTML from the
server. Both packages are available from Hackage.

HTH, Jochem

-- 
Jochem Berndsen | jochem at functor.nl