[Haskell-cafe] tagsoup parser (was: hxt memory useage)

Tue Jan 29 06:54:32 EST 2008

Hi Neil,

> Please send a patch with whatever come up with, so others can make use
> of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs
> version, which does as well as it can with tag matching, but is
> entirely strict. Having a lazy version would be great.

It's too early for a new release,
testing, especially performance testing, is
not yet none, but the first version is in the darcs repository
"http://darcs.fh-wedel.de/hxt/"
(the version number is still 7.4)

Those, who urgently need a more lasy XML parser,
may try that one.

Usage: call readDocument as usual, but with an extra option:
readDocument [..., (a_tagsoup, "1")]

> I've been talking to the Java tagsoup author (http://tagsoup.info),
> which does very clever processing of HTML to make it as structured and
> normalised as possible. He said:
> 
> > The schema that describes HTML can be found at
> > src/definitions/html.tssl in the source archive; I'll be glad to explain
> > any obscurities in it.
> 
> There is also some slides on his website (at the bottom) which detail
> the Java TagSoup approach to reconstructing HTML, and have obviously
> had a lot of thought put into them!

I will have a look into that. Currently the strategy to repair
lousy HTML is the same as in the parsec HTML parser
and that's equivalent to what is done in HaXML.

Uwe