[Haskell-cafe] hxt memory useage
ndmitchell at gmail.com
Fri Jan 25 08:25:35 EST 2008
One of the problems with XML parsing is nesting. Consider this fragment:
<foo>lots of text</foo>
The parser will naturally want to track all the way down to the
closing </foo> in order to check the document is well formed, so it
can put it in a tree. The problem is that means keeping "lots of text"
in memory - often the entire document. TagSoup works in a lazy
streaming manner, so would parse the above as:
[TagOpen "foo" , TagText "lots of text", TagClose "foo"]
i.e. it hasn't matched the foo's, and can return the TagOpen before
even looking at the text.
> XML parsing is still slow, typically
> consuming 90% of the CPU time, but at least it works without blowing
> the heap.
I'd love TagSoup to go faster, while retaining its laziness. A basic
profiling doesn't suggest anything obvious, but I may have missed
something. It's more likely that it would be necessary to prod at the
Core level, or move to supporting both (Lazy)ByteString and [Char].
More information about the Haskell-Cafe