[Haskell-cafe] Re: Re: hxt memory useage

Neil Mitchell ndmitchell at gmail.com
Tue Jan 29 04:22:43 EST 2008


Hi Uwe,

> BTW: I've taken the tagsoup lib and wrote
> a small parser to build a tree out of the stream
> of tags. It's about a 100 lines of code.
> This DOM parser does not need to read until
> the closing tag to build an element node,
> so it should be as lasy as possible.
> A first version for HTML
> already runs on my box,
> but it stil needs a bit of testing

Please send a patch with whatever come up with, so others can make use
of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs
version, which does as well as it can with tag matching, but is
entirely strict. Having a lazy version would be great.

I've been talking to the Java tagsoup author (http://tagsoup.info),
which does very clever processing of HTML to make it as structured and
normalised as possible. He said:

> The schema that describes HTML can be found at
> src/definitions/html.tssl in the source archive; I'll be glad to explain
> any obscurities in it.

There is also some slides on his website (at the bottom) which detail
the Java TagSoup approach to reconstructing HTML, and have obviously
had a lot of thought put into them!

Thanks

Neil


More information about the Haskell-Cafe mailing list