[Haskell-cafe] Incremental XML parsing with namespaces?

Krzysztof Skrzętnicki gtener at gmail.com
Tue Jun 9 04:21:24 EDT 2009


On Mon, Jun 8, 2009 at 20:39, John Millikin<jmillikin at gmail.com> wrote:
> I'm trying to convert an XML document, incrementally, into a sequence
> of XML events. A simple example XML document:
>
> <doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns">
>    <title>Doc title</title>
>    <x:ref>abc1234</x:ref>
>    <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html>
> </doc>
>
> The document can be very large, and arrives in chunks over a socket,
> so I need to be able to "feed" the text data into a parser and receive
> a list of XML events per chunk. Chunks can be separated in time by
> intervals of several minutes to an hour, so pausing processing for the
> arrival of the entire document is not an option. The type signatures
> would be something like:
>
> type Namespace = String
> type LocalName = String
>
> data Attribute = Attribute Namespace LocalName String
>
> data XMLEvent =
>    EventElementBegin Namespace LocalName [Attribute] |
>    EventElementEnd Namespace LocalName |
>    EventContent String |
>   EventError String
>
> parse :: Parser -> String -> (Parser, [XMLEvent])
>
> I've looked at HaXml, HXT, and hexpat, and unless I'm missing
> something, none of them can achieve this:
>
> + HaXml and hexpat seem to disregard namespaces entirely -- that is,
> the root element is parsed to "doc" instead of
> ("org:myproject:mainns", "doc"), and the second child is "x:ref"
> instead of ("org:myproject:otherns", "ref"). Obviously, this makes
> parsing mixed-namespace documents effectively impossible. I found an
> email from 2004[1] that mentions a "filter" for namespace support in
> HaXml, but no further information and no working code.

I would recommend hexpat to do the job. Contrary to what you are
saying, hexpat does offer namespace handling:
http://hackage.haskell.org/packages/archive/hexpat/0.8/doc/html/Text-XML-Expat-Namespaced.html
Perhaps you need more than that?

Personally I found hexpat to be fast, space efficient and easy to use.

Here is the representation I got for your example. Please note the
namespaces in right places.
* > (toNamespaced ( toQualified t'))
Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
nnLocalPart = "doc"}, eAttrs = [(NName {nnNamespace = Just
"http://www.w3.org/2000/xmlns/", nnLocalPart =
"x"},"org:myproject:otherns"),(NName {nnNamespace = Just
"org:myproject:mainns", nnLocalPart =
"xmlns"},"org:myproject:mainns")], eChildren = [Text "\n",Text "
",Element {eName = NName {nnNamespace = Just "org:myproject:mainns",
nnLocalPart = "title"}, eAttrs = [], eChildren = [Text "Doc
title"]},Text "\n",Text "    ",Element {eName = NName {nnNamespace =
Just "org:myproject:otherns", nnLocalPart = "ref"}, eAttrs = [],
eChildren = [Text "abc1234"]},Text "\n",Text "    ",Element {eName =
NName {nnNamespace = Just "http://www.w3.org/1999/xhtml", nnLocalPart
= "html"}, eAttrs = [(NName {nnNamespace = Just
"http://www.w3.org/1999/xhtml", nnLocalPart =
"xmlns"},"http://www.w3.org/1999/xhtml")], eChildren = [Element {eName
= NName {nnNamespace = Just "http://www.w3.org/1999/xhtml",
nnLocalPart = "body"}, eAttrs = [], eChildren = [Text "Hello
world!"]}]},Text "\n"]}

Best regards

Krzysztof Skrzętnicki


More information about the Haskell-Cafe mailing list