[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Fri May 11 09:06:01 EDT 2007

On Fri, 11 May 2007, Jules Bean wrote:

> Henning Thielemann wrote:
> > I want to parse and process HTML lazily. I use HXT because the HTML parser
> > is very liberal. However it uses Parsec and is thus strict. HaXML has a
> > so called lazy parser, but it is not what I consider lazy:
> >
> > *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> > *** Exception: Prelude.undefined
> > *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> > *** Exception: Expected "</" but found &
> >   at file text  at line 1 col 26
> >
> > If it would be lazy, it would return some HTML code before the error.
>
> Are you sure that it is the parser, that is not lazy, and it isn't that
> the pretty printer is overly strict?
>
>  From the evidence above the parser could be returning some results
> before the error, and the pretty printer strictly slurping it all up to
> the error and then dying.

I know, but the type of the Polyparse parser prohibits lazy parsing.
Unfortunately there is no Show instance for HaXML trees, so one cannot
easily see whether laziness gets lost in the parser or in the pretty
printer.