HaXml and HXml toolbox; namespace support

Fri Mar 19 12:10:19 EST 2004

Graham Klyne <gk at ninebynine.org> writes:

> HXml Toolbox
> ------------
> - difficult to find way around documentation;  no obvious high-level 
> description, other than Martin Schmidt's thesis which is out-of-date with 
> respect to the current software.

I fear that HaXml also suffers from inadequate documentation.

> - not developed with Hugs/Windows as an intended target

HaXml does have the advantage of being tested with all three compilers:
ghc, nhc98, and Hugs.  As you have already discovered, support for
Windows is limited, but together we have now developed a 'hack' to
get it going.

> ? efficiency:  some problems parsing large XML files with Hugs 98 are noted.

HaXml /may/ also suffer from space problems when parsing large XML
files.  Joe English's hxml parser is more lazy, and can be used as
a drop-in replacement for HaXml's parser, if this turns to be a problem.
    http://www.flightlab.com/~joe/hxml/

> HaXml
> -----
> + Already part of the common hierarchical library

... and will be distributed as part of the next release of Hugs.

> + XML handling is cleanly separated from other functions
> + separate, hand-coded lexer which I assume will give better performance

With Haskell, never assume anything re expected performance.  I too
would hope the hand-coded lexer gives good performance, but if it
matters, measure it.  There are plenty of profiling tools available.

> + appears to be actively supported

... on a best-effort basis.  I haven't got much time to develop
HaXml actively myself, but am happy to make bugfixes and merge in
new features contributed by others.

> - no namespace support

HaXml ignores namespaces, yes.  The namespace is simply incorporated
into the full name of the element or attribute.  It should be
relatively easy to design filters for querying/transforming namespaces.

> ? DTD Entity handling ?

Parameter entity references (PERefs) are expanded in-line during
parsing of the DTD.  Because they are a macro facility and can occur
at almost any point in the DTD structure, it is difficult to write
a static datatype structure that includes PERefs fully -- so the
Haskell datatypes representing the DTD do not include them at all.
Thus, you cannot /generate/ PERefs in a DTD with HaXml, only read them.

General entity references (GERef) are gathered into a lookup table at
parse time, and stored inside the top-level document data structure:

    data Document = Document Prolog (SymTab EntityDef) Element

None of the other HaXml functions do anything further with them,
but in principle they are there in order to allow them to be
used conveniently.  (The definitions also remain in their original
location within the DTD - they are not macros and do not need to be
expanded away.)

> - errors returned to caller.  As far as I can tell, errors are raised using 
> the 'error' function... [which I see results in program termination when 
> evaluated].  Ouch!  (Why not 'fail' instead of 'error'?)

Good point.  Should be pretty easy to fix.

> - source code needs CPP preprocessing

Entirely for cross-compiler compatibility.

> * no external DTD support [this is not a problem for me, and I'd certainly 
> prefer it to be optional, or at least separated from the XML parsing, to 
> avoid dependency on an HTTP library].

It is perfectly possible to parse an external DTD separately from
the content.  The only question is how to find the DTD.  Someone once
worked on using the local Catalogue to get hold of the external DTD,
given its SYSTEM reference, but I don't recall whether it was fed back
to me, or if it was, why I didn't merge it - probably configuration
issues about discovering the location of the Catalogue.

> A weakness of both packages seems to be the handling of syntax errors in 
> the input.
> 
> HaXml uses HuttonMeijerWallace combinators - could these be extended in the 
> style of Parsec to return an error description, thus making it possible to 
> provide an interface that allows the calling program to handle any errors?

Yes, certainly.  As I recall, the original Hutton/Meijer papers on monadic
parser combinators developed the scheme starting with 'Parser a',
parameterised simply on the return type, through successively more complex
types parameterised on token type, running state, and finally error type,
ending up with 'Parser s t e a'.  Your suggestion of returning an Either
type might be a quick-and-easy compromise.

> Or, even, just use Parsec?

You are welcome to rewrite HaXml's parser in Parsec if you wish.
It might even become more space-efficient.

Regards,
    Malcolm