[Haskell-cafe] hxt memory useage

Fri Jan 25 08:25:35 EST 2008

Hi

One of the problems with XML parsing is nesting. Consider this fragment:

<foo>lots of text</foo>

The parser will naturally want to track all the way down to the
closing </foo> in order to check the document is well formed, so it
can put it in a tree. The problem is that means keeping "lots of text"
in memory - often the entire document. TagSoup works in a lazy
streaming manner, so would parse the above as:

[TagOpen "foo" [], TagText "lots of text", TagClose "foo"]

i.e. it hasn't matched the foo's, and can return the TagOpen before
even looking at the text.

> XML parsing is still slow, typically
> consuming 90% of the CPU time, but at least it works without blowing
> the heap.

I'd love TagSoup to go faster, while retaining its laziness. A basic
profiling doesn't suggest anything obvious, but I may have missed
something. It's more likely that it would be necessary to prod at the
Core level, or move to supporting both (Lazy)ByteString and [Char].

Thanks

Neil