[Haskell-cafe] No copy XML parser (rough idea only)

Thu May 20 13:50:23 EDT 2010

Hi Joachim,

I have been playing around with this idea myself in TagSoup
(http://community.haskell.org/~ndm/tagsoup). The largest conceptual
problem I came across was that TagSoup decodes entities (i.e. &gt;
becomes >). However, I think that's a minor issue, and entity
resolution can be turned off in TagSoup to guarantee copy-free
parsing.

The practical problem is that writing an HTML parser is hard, and that
writing it in a way that works on both String/ByteString/LBS and gets
the copy-free behaviour in the right places is tricky. I am still
working on program optimisation techniques that I think will make this
feasible (based around the idea of supercompilation, to expose the
underlying structure of the parser, without writing it in too painful
a manner). So with any luck your copy-free XML parser will happen
sooner or later.

Thanks, Neil

On Fri, May 14, 2010 at 8:20 PM, Joachim Breitner
<mail at joachim-breitner.de> wrote:
> Hi,
>
> Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
>> On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
>> > Additionally, since the original bytestring is shared in your types,
>> > potentially very large buffers could be locked in memory due to
>> > references held by only a small portion of the document. Chopping a
>> > document up into events or nodes creates some overhead due to the
>> > extra pointers, but allows unneeded portions to be freed.
>>
>> However, if your bytestring comes from mmap'ed memory this
>> drawback wouldn't apply :D.
>
> exactly. Of course such a library would not be a general-purpose tool,
> but in cases where you know that you need most of the document for most
> of the time, e.g. when doing statistics on it, this would be acceptable.
>
> Also note that even after chopping into nodes, if you don’t make sure
> you drop the reference to root in a timely manner, the same thing would
> happen.
>
> Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
>> The primary problem I see with this is that XML content is
>> fundamentally text, not bytes. Using your types, two XML documents
>> with identical content but different encodings will have different
>> Haskell values (and thus be incorrect regarding Eq, Ord, etc).
>
> The instances could be adapted... but this will be expensive, of course.
>
> One could also convert documents that are not utf-8 encoded as a whole
> and then work on that.
>
>> If you'd like memory-efficient text storage, using Bryan O'Sullivan's
>> "text" package[1] is probably the best option. It uses packed Word16
>> buffers to store text as UTF-16. Probably not as efficient as a type
>> backed by UTF-8, but it's much much better than String.
>
> Right. For arbtt, I tried to switch from String to text, and it actually
> got slower. The reason (I think) was that besides passing strings
> around, it mainly runs pcre-light on them – which wants utf8-encoded
> bytestrings.
>
> I ended up creating a newtype¹ around utf8-encoded ByteStrings and the
> result was quite satisfying, both memory- and runtime-wise. I wish we
> had a package providing a standard type for this type that would become
> similarly popular. There is at least one more packages on hackage that
> defines this type:
> http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html
>
>
> Greetings,
> Joachim
>
> ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs
>
> --
> Joachim Breitner
>  e-Mail: mail at joachim-breitner.de
>  Homepage: http://www.joachim-breitner.de
>  ICQ#: 74513189
>  Jabber-ID: nomeata at joachim-breitner.de
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>