[Haskell-cafe] Re: [Haskell] Re: Current XML libraries status
Uwe Schmidt
si at fh-wedel.de
Mon Oct 27 10:24:51 EDT 2008
Hello David,
> I tried to use HXT's readDocument with its tagsoup option for my
> application. I couldn't find a way to construct the operation that
> didn't run out of memory. I'll attach some code using HaXml's
> saxParse so you can see what I want. Is that easy to do in HXT?
> I simply want the text of <PMID> and <AbstractText> elements.
here's an example, that reads the input
in a lazy way. I ran this in ghci with a
file containing 2^20 XML Elements. The file
was about 18Mb in size.
A normal parse with the standard parsec parser
ran out of memory on my 1Gb box. This
one used within ghci about 200Mb max.
------------------------------------
module Main where
import Text.XML.HXT.Arrow
import System
main
= do
mapM_ main' names
main
= do
(name:_) <- getArgs
runX ( readDoc name
>>>
fromLA (deep (hasName "PIMD" -- select the nodes
<+>
hasName "AbstractText"
)
>>>
getChildren -- get the text
>>>
getText
)
>>>
arrIO putStrLn
)
putStrLn "main finished"
readDoc
= readDocument [ (a_tagsoup, v_1)
, (a_parse_xml, v_1)
, (a_remove_whitespace, v_1)
, (a_encoding, isoLatin1)
, (a_issue_warnings, v_0)
, (a_trace, "1")
]
---------------------
Cheers,
Uwe Schmidt
--
Uwe Schmidt
Web: http://www.fh-wedel.de/~si/
More information about the Haskell-Cafe
mailing list