[Haskell-cafe] tagsoup parser (was: hxt memory useage)

Uwe Schmidt uwe at fh-wedel.de
Tue Jan 29 06:54:32 EST 2008


Hi Neil,

> Please send a patch with whatever come up with, so others can make use
> of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs
> version, which does as well as it can with tag matching, but is
> entirely strict. Having a lazy version would be great.

It's too early for a new release,
testing, especially performance testing, is
not yet none, but the first version is in the darcs repository
"http://darcs.fh-wedel.de/hxt/"
(the version number is still 7.4)

Those, who urgently need a more lasy XML parser,
may try that one.

Usage: call readDocument as usual, but with an extra option:
readDocument [..., (a_tagsoup, "1")]

> I've been talking to the Java tagsoup author (http://tagsoup.info),
> which does very clever processing of HTML to make it as structured and
> normalised as possible. He said:
> 
> > The schema that describes HTML can be found at
> > src/definitions/html.tssl in the source archive; I'll be glad to explain
> > any obscurities in it.
> 
> There is also some slides on his website (at the bottom) which detail
> the Java TagSoup approach to reconstructing HTML, and have obviously
> had a lot of thought put into them!

I will have a look into that. Currently the strategy to repair
lousy HTML is the same as in the parsec HTML parser
and that's equivalent to what is done in HaXML.

Uwe


More information about the Haskell-Cafe mailing list