[Haskell-cafe] Incremental XML parsing with namespaces?

Iavor Diatchki iavor.diatchki at gmail.com
Tue Jun 9 18:12:04 EDT 2009


Hi,
you may also want to look at:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/xml
It knows about namespaces and, also, it's parser is lazy.
-Iavor


On Mon, Jun 8, 2009 at 11:39 AM, John Millikin<jmillikin at gmail.com> wrote:
> I'm trying to convert an XML document, incrementally, into a sequence
> of XML events. A simple example XML document:
>
> <doc xmlns="org:myproject:mainns" xmlns:x="org:myproject:otherns">
>    <title>Doc title</title>
>    <x:ref>abc1234</x:ref>
>    <html xmlns="http://www.w3.org/1999/xhtml"><body>Hello world!</body></html>
> </doc>
>
> The document can be very large, and arrives in chunks over a socket,
> so I need to be able to "feed" the text data into a parser and receive
> a list of XML events per chunk. Chunks can be separated in time by
> intervals of several minutes to an hour, so pausing processing for the
> arrival of the entire document is not an option. The type signatures
> would be something like:
>
> type Namespace = String
> type LocalName = String
>
> data Attribute = Attribute Namespace LocalName String
>
> data XMLEvent =
>    EventElementBegin Namespace LocalName [Attribute] |
>    EventElementEnd Namespace LocalName |
>    EventContent String |
>   EventError String
>
> parse :: Parser -> String -> (Parser, [XMLEvent])
>
> I've looked at HaXml, HXT, and hexpat, and unless I'm missing
> something, none of them can achieve this:
>
> + HaXml and hexpat seem to disregard namespaces entirely -- that is,
> the root element is parsed to "doc" instead of
> ("org:myproject:mainns", "doc"), and the second child is "x:ref"
> instead of ("org:myproject:otherns", "ref"). Obviously, this makes
> parsing mixed-namespace documents effectively impossible. I found an
> email from 2004[1] that mentions a "filter" for namespace support in
> HaXml, but no further information and no working code.
>
> + HXT looks promising, because I see explicit mention in the
> documentation of recording and propagating namespaces. However, I
> can't figure out if there's an incremental mode. A page on the wiki[2]
> suggests that SAX is supported in the "html tag soup" parser, but I
> want incremental parsing of *valid* documents. If incremental parsing
> is supported by the standard "arrow" interface, I don't see any
> obvious way to pull events out into a list -- I'm a Haskell newbie,
> and still haven't quite figured out monads yet, let alone Arrows.
>
> Are there any libraries that support namespace-aware incremental parsing?
>
> [1] http://www.haskell.org/pipermail/haskell-cafe/2004-June/006252.html
> [2] http://www.haskell.org/haskellwiki/HXT/Conversion_of_Haskell_data_from/to_XML
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>


More information about the Haskell-Cafe mailing list