[Haskell-cafe] Get data from HTML pages
Jochem Berndsen
jochem at functor.nl
Mon Aug 31 12:22:33 EDT 2009
José Romildo Malaquias wrote:
> Currently the application has an option to indirectly import movie data
> from web pages. For that first the user should access the page in a web
> browser. Then the user should copy the rendered text in the web browser
> into an import window in my application and click an "import" button. In
> response the application parses the given text and collects any relevant
> data it knows about, using regular expressions.
>
> For instance, to get the director information from a movie in the
> AllCenter web site I use the following regular expression:
>
> ^Direção:\s+(.+)$
>
> I want to modify this scheme in order to eliminate the need to copy the
> rendered text from a web browser. Instead my application should download
> and parse the HTML page directly.
>
> Which libraries are available in Haskell that would make it easy to get
> content information from a HTML document, in the way described above?
To parse HTML documents, I've had success with TagSoup in the past. You
can take a look at the HTTP package to download the HTML from the
server. Both packages are available from Hackage.
HTH, Jochem
--
Jochem Berndsen | jochem at functor.nl
More information about the Haskell-Cafe
mailing list