[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what
else?
Jules Bean
jules at jellybean.co.uk
Fri May 11 08:52:33 EDT 2007
Henning Thielemann wrote:
> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
>
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> *** Exception: Expected "</" but found &
> at file text at line 1 col 26
>
> If it would be lazy, it would return some HTML code before the error.
>
Are you sure that it is the parser, that is not lazy, and it isn't that
the pretty printer is overly strict?
From the evidence above the parser could be returning some results
before the error, and the pretty printer strictly slurping it all up to
the error and then dying.
Jules
More information about the Haskell-Cafe
mailing list