[Haskell-cafe] Is XHT a good tool for parsing web pages?

Wed Apr 28 05:00:41 EDT 2010

Hi John and Malcom,

> I know that the HaXml library has a separate error-correcting HTML
> parser that works around most of the common non-well-formedness bugs
> in HTML:
>      Text.XML.HaXml.Html.Parse
>
> I believe HXT has a similar parser:
>      Text.XML.HXT.Parser.HtmlParsec
>
> Indeed, some of the similarities suggest this parser was originally
> lifted directly out of HaXml (as permitted by HaXml's licence),
> although the two modules have now diverged significantly.

The HTML parser in HXT is based on tagsoup. It's a lazy parser
(it does not use parsec) and it tries to parse everything as HTML.
But garbage in, garbage out, there is no approach to repair illegal HTML
as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

The table driven approach for inserting missing closing tags is indeed taken
from HaXml. Malcom, I hope you don't have a patent on this algorithm. 

Regards,

   Uwe