[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Jules Bean jules at jellybean.co.uk
Fri May 11 08:52:33 EDT 2007


Henning Thielemann wrote:
> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
>
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> *** Exception: Expected "</" but found &
>   at file text  at line 1 col 26
>
> If it would be lazy, it would return some HTML code before the error.
>   

Are you sure that it is the parser, that is not lazy, and it isn't that 
the pretty printer is overly strict?

 From the evidence above the parser could be returning some results 
before the error, and the pretty printer strictly slurping it all up to 
the error and then dying.

Jules



More information about the Haskell-Cafe mailing list