[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann lemming at henning-thielemann.de
Fri May 11 08:24:55 EDT 2007


I want to parse and process HTML lazily. I use HXT because the HTML parser
is very liberal. However it uses Parsec and is thus strict. HaXML has a
so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
*** Exception: Prelude.undefined
*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
*** Exception: Expected "</" but found &
  at file text  at line 1 col 26

If it would be lazy, it would return some HTML code before the error.
HaXML uses the Polyparse package for parsing which contains a so called
lazy parser. However it has return type (Either String a). That is, for
the decision whether the parse was successful, the document has to be
parsed completely.

*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
("*** Exception: Parse.satisfy: failed

If it would have return type (String, a) it could return both a partial
value of type 'a' and the error message as String. It would be even better
if it has some handling for incorrect input texts, and returns ([String],
a), where [String] is the type of a list of warnings and error messages
and 'a' is the type of a total value of parse output.

Is there some parser of this type? Unfortunately
 http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools
   does not compare the laziness of the mentioned parsers.


More information about the Haskell-Cafe mailing list