[Haskell-cafe] Is XHT a good tool for parsing web pages?
Uwe Schmidt
uwe at fh-wedel.de
Wed Apr 28 05:00:41 EDT 2010
Hi John and Malcom,
> I know that the HaXml library has a separate error-correcting HTML
> parser that works around most of the common non-well-formedness bugs
> in HTML:
> Text.XML.HaXml.Html.Parse
>
> I believe HXT has a similar parser:
> Text.XML.HXT.Parser.HtmlParsec
>
> Indeed, some of the similarities suggest this parser was originally
> lifted directly out of HaXml (as permitted by HaXml's licence),
> although the two modules have now diverged significantly.
The HTML parser in HXT is based on tagsoup. It's a lazy parser
(it does not use parsec) and it tries to parse everything as HTML.
But garbage in, garbage out, there is no approach to repair illegal HTML
as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.
The table driven approach for inserting missing closing tags is indeed taken
from HaXml. Malcom, I hope you don't have a patent on this algorithm.
Regards,
Uwe
More information about the Haskell-Cafe
mailing list