[Haskell-cafe] Re: hxt memory useage

Mon Jan 28 08:13:17 EST 2008

Rene de Visser wrote:

> "Matthew Pocock" <matthew.pocock at ncl.ac.uk> schrieb im Newsbeitrag 
> news:200801241917.33281.matthew.pocock at ncl.ac.uk...
> > On Thursday 24 January 2008, Albert Y. C. Lai wrote:
> >> Matthew Pocock wrote:
> >> > I've been using hxt to process xml files. Now that my files are getting 
> >> > a
> >> > bit bigger (30m) I'm finding that hxt uses inordinate amounts of 
> >> > memory.
> >> > I have 8g on my box, and it's running out. As far as I can tell, this
> >> > memory is getting used up while parsing the text, rather than in any
> >> > down-stream processing by xpickle.
> >> >
> >> > Is this a known issue?
> >>
> >> Yes, hxt calls parsec, which is not incremental.
> >>
> >> haxml offers the choice of non-incremental parsers and incremental
> >> parsers. The incremental parsers offer finer control (and therefore also
> >> require finer control).
> >
> > I've got a load of code using xpickle, which taken together are quite an
> > investment in hxt. Moving to haxml may not be very practical, as I'll have 
> > to
> > find some eqivalent of xpickle for haxml and port thousands of lines of 
> > code
> > over. Is there likely to be a low-cost solution to convincing hxt to be
> > incremental that would get me out of this mess?
> >
> > Matthew
> 
> I don't think so. Even if you replace parsec, HXT is itself not incremental. 
> (It stores the whole XML document in memory as a tree, and the tree is not 
> memory effecient.

this statement isn't true in general. HXT itself can be incremental, if there
is no need for traversing the whole XML tree. When processing a document
containing a DTD, indeed there is a need even when no validation is required,
for traversal because of the entity substitution.

Technically it's not a big deal to write a very simple and lasy parser, or to
take the tagsoup or haxml lasy parsers and adapt it to the hxt DOM structure.
Combining the parser with the ByteString lib raises a small problem,
the handling of Unicode chars, so there is a need for a lasy Word8 to Unicode (Char)
conversion, but that's already in HXT (thanks to Henning Thielemann).

So the problem is not a technical one, it's just a matter of time an resources.
If someone has such a lightweigt lasy xml parser, I will help to integrate it into
hxt.

  Uwe