[Haskell-cafe] Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Neil Mitchell ndmitchell at gmail.com
Fri May 11 08:32:15 EDT 2007


Hi

Depending on exactly what you want, TagSoup may be of interest to you.
It is lazy, but it doesn't return a tree. It is very tollerant of
errors, and will simply never "fail to parse" something.

http://www-users.cs.york.ac.uk/~ndm/tagsoup/

Thanks

Neil

On 5/11/07, Henning Thielemann <lemming at henning-thielemann.de> wrote:
>
> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
>
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> *** Exception: Expected "</" but found &
>   at file text  at line 1 col 26
>
> If it would be lazy, it would return some HTML code before the error.
> HaXML uses the Polyparse package for parsing which contains a so called
> lazy parser. However it has return type (Either String a). That is, for
> the decision whether the parse was successful, the document has to be
> parsed completely.
>
> *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
> ("*** Exception: Parse.satisfy: failed
>
> If it would have return type (String, a) it could return both a partial
> value of type 'a' and the error message as String. It would be even better
> if it has some handling for incorrect input texts, and returns ([String],
> a), where [String] is the type of a list of warnings and error messages
> and 'a' is the type of a total value of parse output.
>
> Is there some parser of this type? Unfortunately
>  http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools
>    does not compare the laziness of the mentioned parsers.
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>


More information about the Haskell-Cafe mailing list