[Haskell-cafe] XML parser recommendation?

Tue Oct 23 07:16:02 EDT 2007

"Yitzchak Gale" <gale at sefer.org> wrote:

> Henning Thielemann wrote:
> > HXT uses Parsec, which is strict.
> 
> Is is strict to the extent that it cannot produce any
> output at all until it has read the entire XML document?
> That would make HXT (and Parsec, for that matter)
> useless for a large percentage of tasks.

Yes, and yes.

By contrast, the Utrecht parser combinator library gives "online"
results, meaning that it delivers as much as it can without ambiguity.
It is a bit like laziness, but it analyses the grammar to determine when
it is safe to commit to a value, essentially once no error has been seen
in a prefix of the input.

And the polyparse library has several variations of properly lazy
parsers, which only return results on demand (but there might be parse
errors hidden inside the returned values, as exceptions).  The user
(grammar-writer) decides where the results should be lazy or strict.

HaXml now uses the polyparse library, and you can choose whether you
want well-formedness checking with the original strict parser, or lazy
space-efficient on-demand parsing.  Initial performance results show
that parsing XML lazily is always better than 2x as fast, and 1/2x peak
memory usage of strict parsing.  In some usage patterns, it can reduce
the cost of processing from linear in the size of the document, to a
constant (the distance into the document to find a particular element).

I have just made fresh releases of development versions of these
libraries, for those who would like to experiment.

    http://www.cs.york.ac.uk/fp/polyparse
    http://www.cs.york.ac.uk/fp/HaXml-devel

They are also available on hackage.haskell.org.

Regards,
    Malcolm