[Haskell-cafe] Incremental XML parsing with namespaces?
John Millikin
jmillikin at gmail.com
Mon Jun 8 23:00:03 EDT 2009
On Mon, Jun 8, 2009 at 3:39 PM, Henning
Thielemann<lemming at henning-thielemann.de> wrote:
> I think you could use the parser as it is and do the name parsing later.
> Due to lazy evaluation both parsers would run in an interleaved way.
>
I've been trying to figure out how to get this to work with lazy
evaluation, but haven't made much headway. Tips? The only way I can
think of to get incremental parsing working is to maintain explicit
state, but I also can't figure out how to achieve this with the
parsers I've tested (HaXml, HXT, hexpat).
Here's a working example of what I'm trying to do, in Python. It reads
XML from stdin, prints events as they are parsed, and will terminate
when the document ends:
##########################
from xml.sax import handler, saxutils, expatreader
class ContentHandler (handler.ContentHandler):
def __init__ (self):
self.events = []
self.level = 0
def startElementNS (self, ns_name, lname, attrs):
self.events.append (("BEGIN", ns_name, lname, dict (attrs)))
self.level += 1
def endElementNS (self, ns_name, lname):
self.events.append (("END", ns_name, lname))
self.level -= 1
def characters (self, content):
self.events.append (("TEXT", content))
def main ():
parser = expatreader.ExpatParser ()
content = ContentHandler ()
parser.setFeature (handler.feature_namespaces, True)
parser.setContentHandler (content)
got_events = False
while content.level > 0 or (not got_events):
text = raw_input ("Enter XML:\n")
parser.feed (text)
print content.events
content.events = []
got_events = True
if __name__ == "__main__": main()
###############################
$ python incremental.py
Enter XML:
<test xmlns="urn:test"><test2><test3>
[('BEGIN', (u'urn:test', u'test'), u'test', {}), ('BEGIN',
(u'urn:test', u'test2'), u'test2', {}), ('BEGIN', (u'urn:test',
u'test3'), u'test3', {})]
Enter XML:
</test3></test2><test2 a="b"/>text content goes here
[('END', (u'urn:test', u'test3'), None), ('END', (u'urn:test',
u'test2'), None), ('BEGIN', (u'urn:test', u'test2'), u'test2', {(None,
u'a'): u'b'}), ('END', (u'urn:test', u'test2'), None), ('TEXT', u'text
content goes here')]
Enter XML:
</test>
[('END', (u'urn:test', u'test'), None)]
#############################
As demonstrated, the parser retains state (namespaces, nesting)
between text inputs. Are there any XML parsers for Haskell that
support this incremental behavior?
More information about the Haskell-Cafe
mailing list