[Haskell-cafe] Programming style and XML processing in Haskell
Graham Klyne
gk at ninebynine.org
Thu May 13 18:13:49 EDT 2004
After an extended period of procrastination and nipping at the edges of the
problem, I feel a need to tackle head on the requirement for a "usable" XML
handling library in Haskell. As far back as 1999, Malcolm Wallace and
Colin Runciman observed that "Haskell is a very suitable language for XML
processing" [1], yet there still does not seem to be a generically useful
XML handling library, suitable (say) for inclusion as part of a standard
Haskell compiler library.
Why do I say this? I have looked at three separate XML libraries, and each
has problems which I perceive make them unsuitable for the purposes I have
in mind. It may be that part of my problem here is a mistaken view of
Haskell programming style, possible exposure of which is one of my reasons
for posting this.
My immediate goal is to create an RDF/XML parser, the output from which is
a data structure representing an RDF graph. This involves parsing the XML
to an XML-infoset-like form, then traversing this to extract information
for the RDF graph. I want to create a function like this:
parseRDFXML :: String -> RDFGraph
My XML processing requirements:
(1) basic XML parsing
(2) predefined and character entity handling (< & &#n; etc)
(3) general entity handling (DTD entity definitions and substitutions, per
internal DTD subset)
(4) easy access to values extracted from XML data (i.e. for XML-to-non-XML
processing)
(5) XML namespace handling
(6) library usable outside the IO monad (i.e. by functions that return
non-IO values)
Non-requirements, but maybe nice to have:
(7) parameter entities
(8) External entities
(9) XML/DTD validation
(1)-(4) correspond roughly to the level of support required of XML parsers
for handling "standalone" documents.
Almost all modern usage of XML that I'm aware of depends to lesser or
greater degree on on (5).
(8) and (9) are, I think, in conflict with having library functions that
can be used outside the IO monad, since they require that the parser be
able to access external data. (7) is really a helper for DTD-based
validation, and my own view is that validity checking is better performed
using XML schema. Some of the facilities provided by (8) are now being
addressed by alternative activities that build upon a basic XML (XInclude,
Binary attachments for SOAP, etc.).
Requirement (6) arises for me because I have adopted a style of programming
in Haskell that is mostly consisting of pure functions, without recourse to
monads. I use parser monads locally as required, and I use IO and state
monads at the upper levels of my programs to deal with and record the
program's interaction with the outside world. It seems to me that this
approach leads to functions that are easier to pick up and use. I've found
that, when using third party libraries, stand-alone functions present an
easier learning curve compared with libraries that are based around a
(sometimes complex) monadic state. Maybe I'm missing something here?
...
Turning to the XML libraries, I've looked at three:
(A) Joe English's HXML parser [2]
(B) HaXml [3]
(C) Haskell Xml Toolbox [4]
(A) HXML is very easy to understand and use, but it does very little more
than basic XML parsing. No level of DTD handling is provided, as far as
I've been able to determine.
(B) HaXML does a little more of what I want, to the extent that it can
parse DTDs, and even perform some basic validity checking. I can't find
anywhere in the code that seems to address substitution of entities defined
in the DTD, and I'm not sure if it can parse a DTD and XML from the same
XML file. There are references in the code to external DTD subsets, but I
can't see any attempt to implement this. I have found that the HaXML's
error handling is rather severe, in that there are a wide of input data
errors that cause the library to 'error' rather than return a diagnostic value.
(C) Haskell Xml Toolbox is the most functional package (being the only
package with XML namespace support) and also the most difficult to
use. Unfortunately, it seems that much of the DTD functionality (needed
for expanding general entities) is performed I/O monad, as it is part of
the code than performs validation, which, as noted above, needs access to
external resources.
My biggest problem with this package is that it seems to be very difficult
and unwieldy to use as part of another library: much of the code seems to
be oriented toward creating complete programs for XML-to-XML
transformations of various kinds.
I've a view that XML namespace support should be quite easy to graft onto
either of the other packages, given an extension to the data type used to
describe nodes and elements. Over the past couple of months, I've been
wavering between pushing ahead with (B) or (C). Both have problems, and
either would require significant effort on my part.
If I use (C), it involves the least amount of new code, but I think I would
find myself ripping out chunks of code to create functions that I can use
outside the IO monad, which would effectively fork the codebase.
If I use (B), I need to address the error handling problem, though I think
I know roughly how to do that (I already made a start; details below). A
previous problem I had was that the HaXML code needs CPP preprocessing,
which was problematic for me, but since then a simple CPP-equivalent in
Haskell has been implemented so I think I can work around that problem. I
think I'd need to write new code to deal with entity substitution and
namespaces, but I think both of those could be implemented as filters that
layer on top of the basic package.
So the pendulum swings again, and I now think that HaXML is looking like
the most promising base for further development.
...
What do I think an XML library for Haskell should look like? The
component's I'd like to see would look something like this:
XML parser :: String ---> (internal representation)
| \
| -----> IO function to perform full validation
| and external DTD handling [optional]
v
XML filter combinators --+--> entity substitution logic
| +--> namespace handling
| +--> XSLT processing [optional, for now **]
v
DOM-like read-only interface for access to data at level comparable to
XML infoset (used to avoid dependency between applications that use
infoset data and details of the internal representation used.)
Does this seem reasonable?
[**] my thought is that an XSLT document could be "compiled" into an XML
filter function.
#g
--
[1] http://www.cs.york.ac.uk/fp/HaXml/icfp99.html#furtherwork
[2] http://www.flightlab.com/~joe/hxml/
[3] http://www.cs.york.ac.uk/fp/HaXml/
[4] http://www.fh-wedel.de/~si/HXmlToolbox/
....
Work I've already done on HaXML:
I made a start on a unit test program, and some modifications to the HMW
combinator library to allow parse errors to be handled by the calling
program. The initial test data has been stolen from the Hxml Toolbox
software kit. The 4 test cases all run without errors under Hugs. Feel
free to grab anything you think may be useful.
The revisions are here:
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/
The test program and data files are here:
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/test/
(The test program has a commented-out feature to generate formatted
versions of the input files which can be renamed for use as comparison test
data.)
The modified source code includes:
+
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/ParserCombinators/HuttonMeijerWallace.hs
modified to include an option to return a diagnostic message or parser
result, via an (Either String a) value. The original interface is (mostly)
preserved, and new functions added to support the extended return values
(e.g. papplydiag).
+
http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.11/src/Text/XML/HaXml/Parse.hs
modified to work with the regised parser structure. A new function,
xmlParseDiag, added to return an (Either String a) value. Also added an
eof parser and documentOnly functions to perform some of the function of
sanitycheck. (I also commented out the #if stuff so I could test under Hugs.)
(Malcolm pointed out to me that the lexer also throws some errors, but I
think that could be addressed by returning an error token and leaving the
parser to deal with the resulting syntax error.)
------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact
More information about the Haskell-Cafe
mailing list