[Haskell-cafe] No copy XML parser (rough idea only)

Fri May 14 15:20:35 EDT 2010

Hi,

Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
> On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
> > Additionally, since the original bytestring is shared in your types,
> > potentially very large buffers could be locked in memory due to
> > references held by only a small portion of the document. Chopping a
> > document up into events or nodes creates some overhead due to the
> > extra pointers, but allows unneeded portions to be freed.
> 
> However, if your bytestring comes from mmap'ed memory this
> drawback wouldn't apply :D.

exactly. Of course such a library would not be a general-purpose tool,
but in cases where you know that you need most of the document for most
of the time, e.g. when doing statistics on it, this would be acceptable.

Also note that even after chopping into nodes, if you don’t make sure
you drop the reference to root in a timely manner, the same thing would
happen.

Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
> The primary problem I see with this is that XML content is
> fundamentally text, not bytes. Using your types, two XML documents
> with identical content but different encodings will have different
> Haskell values (and thus be incorrect regarding Eq, Ord, etc).

The instances could be adapted... but this will be expensive, of course.

One could also convert documents that are not utf-8 encoded as a whole
and then work on that.

> If you'd like memory-efficient text storage, using Bryan O'Sullivan's
> "text" package[1] is probably the best option. It uses packed Word16
> buffers to store text as UTF-16. Probably not as efficient as a type
> backed by UTF-8, but it's much much better than String.

Right. For arbtt, I tried to switch from String to text, and it actually
got slower. The reason (I think) was that besides passing strings
around, it mainly runs pcre-light on them – which wants utf8-encoded
bytestrings.

I ended up creating a newtype¹ around utf8-encoded ByteStrings and the
result was quite satisfying, both memory- and runtime-wise. I wish we
had a package providing a standard type for this type that would become
similarly popular. There is at least one more packages on hackage that
defines this type:
http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html

Greetings,
Joachim

¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs

-- 
Joachim Breitner
  e-Mail: mail at joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  ICQ#: 74513189
  Jabber-ID: nomeata at joachim-breitner.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20100514/af808726/attachment.bin