[Haskell-cafe] Incremental XML parsing with namespaces?

John Millikin jmillikin at gmail.com
Mon Jun 8 14:39:25 EDT 2009


I'm trying to convert an XML document, incrementally, into a sequence
of XML events. A simple example XML document:

<doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns">
   <title>Doc title</title>
   <x:ref>abc1234</x:ref>
   <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html>
</doc>

The document can be very large, and arrives in chunks over a socket,
so I need to be able to "feed" the text data into a parser and receive
a list of XML events per chunk. Chunks can be separated in time by
intervals of several minutes to an hour, so pausing processing for the
arrival of the entire document is not an option. The type signatures
would be something like:

type Namespace = String
type LocalName = String

data Attribute = Attribute Namespace LocalName String

data XMLEvent =
   EventElementBegin Namespace LocalName [Attribute] |
   EventElementEnd Namespace LocalName |
   EventContent String |
   EventError String

parse :: Parser -> String -> (Parser, [XMLEvent])

I've looked at HaXml, HXT, and hexpat, and unless I'm missing
something, none of them can achieve this:

+ HaXml and hexpat seem to disregard namespaces entirely -- that is,
the root element is parsed to "doc" instead of
("org:myproject:mainns", "doc"), and the second child is "x:ref"
instead of ("org:myproject:otherns", "ref"). Obviously, this makes
parsing mixed-namespace documents effectively impossible. I found an
email from 2004[1] that mentions a "filter" for namespace support in
HaXml, but no further information and no working code.

+ HXT looks promising, because I see explicit mention in the
documentation of recording and propagating namespaces. However, I
can't figure out if there's an incremental mode. A page on the wiki[2]
suggests that SAX is supported in the "html tag soup" parser, but I
want incremental parsing of *valid* documents. If incremental parsing
is supported by the standard "arrow" interface, I don't see any
obvious way to pull events out into a list -- I'm a Haskell newbie,
and still haven't quite figured out monads yet, let alone Arrows.

Are there any libraries that support namespace-aware incremental parsing?

[1] http://www.haskell.org/pipermail/haskell-cafe/2004-June/006252.html
[2] http://www.haskell.org/haskellwiki/HXT/Conversion_of_Haskell_data_from/to_XML


More information about the Haskell-Cafe mailing list