[Haskell-cafe] Programming style and XML processing in Haskell

Graham Klyne gk at ninebynine.org
Fri May 14 15:00:43 EDT 2004

At 17:45 13/05/04 +0100, MR K P SCHUPKE wrote:
>Just sticking in my two pence worth...
>I am not sure what application you intend this for, but I find most XML
>parsers completely useless. With my application programmers hat on, I do
>not want to validate against a DTD, I want to extract as much information
>as possible from bad XML... what I would like is a correcting parser - one
>which outputs XML in compliance, but will accept any old rubbish and make
>a best guess attempt to fix it up (based on a set of configurable
>heuristic rules)...

I would think this is a rather specialized requirement.  I certainly don't 
want a "correcting" parser for my work.  But I can see that some 
applications might...

>Secondly I deal with very large documents, the tree form of which won't fit
>in memory, so I would see an XML parser doin the following...
>         parser :: String -> [XmlElements]
>         filter :: [XmlElements] -> [XmlElements]
>         reader :: [XmlElements] -> ... output data types ...
>         writer :: ... input data types ... -> [XmlElements]
>         render :: [XmlElements] -> String
>In order to keep track of the tree structure the tree-depth of each element
>is encoded within the XmlElement type... thus allowing the data to be streamed
>through the filters/readers etc. This means the parser can output the 
>first element as
>soon as it encounters the second element (lazy list == stream in Haskell)
>rather than having to wait until the last element as would happen with a 
>DOM tree
>(it is a tree not a graph as XML elements can only contain sub-elements)...

This seems reasonable, and I'd expect a reasonable implementation (of a 
filter) to stream via lazy evaluation where that matches the final usage 
pattern.  The outline I sketched (copied below) was intended to be built 
upon something like HaXML's filter idea, so that streaming processing would 
(in principle) be possible.

My requirement is not to generate yet more XML, but to extract something 
quite different from the XML, so I think I'd be looking for something like 
your 'reader', which could be part of the lowest element in my diagram.

>As I said the above is just my opinion, and as it happens I have written a
>parser that does the above... I guess that is why there are several
>parsers for XML available (different requirements) and there will probably
>be many more ...

I agree about the different requirements, but I think it would be good if 
this didn't mean different XML libraries;  I'm fishing for an arrangement 
that allows the different requirements to be satisfied from common (or 
overlapping) components.  I like your suggested 
parser/filter/reader/writer/render model, and I'll consider how that fits 
with the existing libraries (I really don't want to start from scratch 
here).  I guess a 'parser' could be a special case of 'writer', and 
'render' a special case of 'reader'.


(Reprise of last part of my previous message...)

What do I think an XML library for Haskell should look like?   The 
component's I'd like to see would look something like this:

     XML parser :: String ---> (internal representation)
         |                       \
         |                        -----> IO function to perform full validation
         |                               and external DTD handling [optional]
     XML filter combinators --+--> entity substitution logic
         |                    +--> namespace handling
         |                    +--> XSLT processing [optional, for now **]
     DOM-like read-only interface for access to data at level comparable to
     XML infoset (used to avoid dependency between applications that use
     infoset data and details of the internal representation used.)

Graham Klyne
For email:

More information about the Haskell-Cafe mailing list