[Haskell-cafe] Re: Lazy XML handling

Fri May 21 22:27:23 EDT 2004

MR K P Shupke wrote:
>Can you give me an example where processing HAS to be in a different order?

No, but I can give an example where a list based encoding is less 
sufficient.

I'm coming from a completely different view here, namely that of using 
Haskell not to query or transform but to create XML. I'm currently in a 
project[1] trying to fix a working implementation of Haskell Server 
Pages[2], i.e. using Haskell as a server-side scripting language for dynamic 
webpages. Clearly the most used mode of operation in this setting is 
constructing an XML tree rather than querying or transforming an existing 
one.

With this in mind, the list based encoding would be less efficient than a 
treelike datatype. Building the XML tree is likely to be done bottom-up, and 
adding a parent to a subtree would be O(n) with a list, but O(1) with a tree 
datatype. Building a whole tree from the bottom up is O(n^2) for a list, but 
O(n) for the tree datatype.

MR K P Shupke wrote:
>The nested lists does not allow any extra flexibility in traversal, at the 
>cost of
>greater complexity. The one case it is simpler is if you want to look up a 
>single
>node by its 'tree address' - This is more complex with the list-rep, as you
>have to count nodes at a given level, skipping irrelevent nodes...

The fact that the list based representation is indeed more complex is in my 
opinion a clear indication that it should not be the standard. Compare this 
to Haskell strings where the standard is the simple but ineffective [Char], 
but where there is another more efficient representation, PackedString, if 
you really need it. Simplicity and ease of understanding is not to be taken 
lightly.
By using a tree datatype representation you also get pattern matching 
directly on the structure of the tree, a very useful feature to have.

MR K P Shupke wrote:
>The list rep is very good at data extraction, if you want to extract a 
>sub-tree
>of a given type you just ignore all tags until you find one of the correct
>type, then output all tags until the depth returns to the level of the 
>initial
>tag.

That might possibly be a programmatic advantage, though I don't see how it 
would be more difficult with a tree datatype. In any case it is no more 
efficient, both are O(n).
As far as I can tell there is no single area where the list representation 
would outperform a tree datatype, other than the special case of lazy 
streaming.

My suggestions are thus:
The standard representation should be a tree datatype, both because it is 
more general and efficiently supports a wide range of operation modes 
(construction, querying, transformation, rendering), and also because that 
is what lies closest to the intuition of XML.
However, the ability to stream large XML structures lazily is clearly useful 
and is difficult to achieve using a tree datatype. Why not add it as a 
specialisation?
The standard representation would be a datatype XML living in Data.XML (or 
similar), and the specialized would be a type XMLStream living in 
Data.XML.Stream. There should of course also be functions to convert between 
the types.

Comment away,

/Niklas

[1] http://www.dtek.chalmers.se/~d00nibro/hsp/
[2] http://www.research.microsoft.com/~emeijer/Papers/HSP.pdf

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online 
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963