[Haskell-cafe] Is XHT a good tool for parsing web pages?

Wed Apr 28 07:27:26 EDT 2010

Hi Ivan,

> Uwe Schmidt <uwe at fh-wedel.de> writes:
> > The HTML parser in HXT is based on tagsoup. It's a lazy parser
> > (it does not use parsec) and it tries to parse everything as HTML.
> > But garbage in, garbage out, there is no approach to repair illegal HTML
> > as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.
>
> So what is parsec used for in HXT then?

for the XML parser. This XML parser also deals with DTDs. This parser only 
accepts well formed XML, everything else gives an error (not just a warning 
like HTML parser). tagsoup and the HTML parser do not deal with DTDs,
so they can't be used for a full (validating) XML parser.

Regards,

   Uwe