[Haskell-cafe] Programming style and XML processing in Haskell

Thu May 13 18:13:49 EDT 2004

After an extended period of procrastination and nipping at the edges of the 
problem, I feel a need to tackle head on the requirement for a "usable" XML 
handling library in Haskell.  As far back as 1999, Malcolm Wallace and 
Colin Runciman observed that "Haskell is a very suitable language for XML 
processing" [1], yet there still does not seem to be a generically useful 
XML handling library, suitable (say) for inclusion as part of a standard 
Haskell compiler library.

Why do I say this?  I have looked at three separate XML libraries, and each 
has problems which I perceive make them unsuitable for the purposes I have 
in mind.  It may be that part of my problem here is a mistaken view of 
Haskell programming style, possible exposure of which is one of my reasons 
for posting this.

My immediate goal is to create an RDF/XML parser, the output from which is 
a data structure representing an RDF graph.  This involves parsing the XML 
to an XML-infoset-like form, then traversing this to extract information 
for the RDF graph.  I want to create a function like this:
     parseRDFXML :: String -> RDFGraph

My XML processing requirements:
(1) basic XML parsing
(2) predefined and character entity handling (&lt; &amp; &#n; etc)
(3) general entity handling (DTD entity definitions and substitutions, per 
internal DTD subset)
(4) easy access to values extracted from XML data (i.e. for XML-to-non-XML 
processing)
(5) XML namespace handling
(6) library usable outside the IO monad (i.e. by functions that return 
non-IO values)

Non-requirements, but maybe nice to have:
(7) parameter entities
(8) External entities
(9) XML/DTD validation

(1)-(4) correspond roughly to the level of support required of XML parsers 
for handling "standalone" documents.

Almost all modern usage of XML that I'm aware of depends to lesser or 
greater degree on on (5).

(8) and (9) are, I think, in conflict with having library functions that 
can be used outside the IO monad, since they require that the parser be 
able to access external data.  (7) is really a helper for DTD-based 
validation, and my own view is that validity checking is better performed 
using XML schema.  Some of the facilities provided by (8) are now being 
addressed by alternative activities that build upon a basic XML (XInclude, 
Binary attachments for SOAP, etc.).

Requirement (6) arises for me because I have adopted a style of programming 
in Haskell that is mostly consisting of pure functions, without recourse to 
monads.  I use parser monads locally as required, and I use IO and state 
monads at the upper levels of my programs to deal with and record the 
program's interaction with the outside world.  It seems to me that this 
approach leads to functions that are easier to pick up and use.  I've found 
that, when using third party libraries, stand-alone functions present an 
easier learning curve compared with libraries that are based around a 
(sometimes complex) monadic state.   Maybe I'm missing something here?

...

Turning to the XML libraries, I've looked at three:
(A) Joe English's HXML parser [2]
(B) HaXml [3]
(C) Haskell Xml Toolbox [4]

(A) HXML is very easy to understand and use, but it does very little more 
than basic XML parsing.  No level of DTD handling is provided, as far as 
I've been able to determine.

(B) HaXML does a little more of what I want, to the extent that it can 
parse DTDs, and even perform some basic validity checking.  I can't find 
anywhere in the code that seems to address substitution of entities defined 
in the DTD, and I'm not sure if it can parse a DTD and XML from the same 
XML file.  There are references in the code to external DTD subsets, but I 
can't see any attempt to implement this.  I have found that the HaXML's 
error handling is rather severe, in that there are a wide of input data 
errors that cause the library to 'error' rather than return a diagnostic value.

(C) Haskell Xml Toolbox is the most functional package (being the only 
package with XML namespace support) and also the most difficult to 
use.  Unfortunately, it seems that much of the DTD functionality (needed 
for expanding general entities) is performed I/O monad, as it is part of 
the code than performs validation, which, as noted above, needs access to 
external resources.

My biggest problem with this package is that it seems to be very difficult 
and unwieldy to use as part of another library:  much of the code seems to 
be oriented toward creating complete programs for XML-to-XML 
transformations of various kinds.

I've a view that XML namespace support should be quite easy to graft onto 
either of the other packages, given an extension to the data type used to 
describe nodes and elements.  Over the past couple of months, I've been 
wavering between pushing ahead with (B) or (C).  Both have problems, and 
either would require significant effort on my part.

If I use (C), it involves the least amount of new code, but I think I would 
find myself ripping out chunks of code to create functions that I can use 
outside the IO monad, which would effectively fork the codebase.

If I use (B), I need to address the error handling problem, though I think 
I know roughly how to do that (I already made a start;  details below).   A 
previous problem I had was that the HaXML code needs CPP preprocessing, 
which was problematic for me, but since then a simple CPP-equivalent in 
Haskell has been implemented so I think I can work around that problem.  I 
think I'd need to write new code to deal with entity substitution and 
namespaces, but I think both of those could be implemented as filters that 
layer on top of the basic package.

So the pendulum swings again, and I now think that HaXML is looking like 
the most promising base for further development.

...

What do I think an XML library for Haskell should look like?   The 
component's I'd like to see would look something like this:

     XML parser :: String ---> (internal representation)
         |                       \
         |                        -----> IO function to perform full validation
         |                               and external DTD handling [optional]
         v
     XML filter combinators --+--> entity substitution logic
         |                    +--> namespace handling
         |                    +--> XSLT processing [optional, for now **]
         v
     DOM-like read-only interface for access to data at level comparable to
     XML infoset (used to avoid dependency between applications that use
     infoset data and details of the internal representation used.)

Does this seem reasonable?

[**] my thought is that an XSLT document could be "compiled" into an XML 
filter function.

#g
--

[1] http://www.cs.york.ac.uk/fp/HaXml/icfp99.html#furtherwork

[2] http://www.flightlab.com/~joe/hxml/

[3] http://www.cs.york.ac.uk/fp/HaXml/

[4] http://www.fh-wedel.de/~si/HXmlToolbox/

....

Work I've already done on HaXML:

I made a start on a unit test program, and some modifications to the HMW 
combinator library to allow parse errors to be handled by the calling 
program.  The initial test data has been stolen from the Hxml Toolbox 
software kit.  The 4 test cases all run without errors under Hugs.  Feel 
free to grab anything you think may be useful.

The revisions are here:
   http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/

The test program and data files are here:
   http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/test/
(The test program has a commented-out feature to generate formatted 
versions of the input files which can be renamed for use as comparison test 
data.)

The modified source code includes:

+ 
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/ParserCombinators/HuttonMeijerWallace.hs
modified to include an option to return a diagnostic message or parser 
result, via an (Either String a) value.  The original interface is (mostly) 
preserved, and new functions added to support the extended return values 
(e.g.  papplydiag).

+ 
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/XML/HaXml/Parse.hs
modified to work with the regised parser structure.  A new function, 
xmlParseDiag, added to return an (Either String a) value.  Also added an 
eof parser and documentOnly functions to perform some of the function of 
sanitycheck.  (I also commented out the #if stuff so I could test under Hugs.)

(Malcolm pointed out to me that the lexer also throws some errors, but I 
think that could be addressed by returning an error token and leaving the 
parser to deal with the resulting syntax error.)

------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact