haskell xml parsing for larger files?

Christian Maeder Christian.Maeder at dfki.de
Thu Feb 20 14:56:59 UTC 2014


I've just tried:

   import Text.HTML.TagSoup
   import Text.HTML.TagSoup.Tree

   main :: IO ()
   main = getContents >>= putStr . renderTags . flattenTree . tagTree . 
parseTags

which also ends with the getMBlock error.
Only "renderTags . parseTags" works fine (like the hexpat SAX parser).

Why should tagsoup be better suited for building trees from large files?

C.

Am 20.02.2014 15:30, schrieb Chris Smith:
> Have you looked at tagsoup?
>
> On Feb 20, 2014 3:30 AM, "Christian Maeder" <Christian.Maeder at dfki.de
> <mailto:Christian.Maeder at dfki.de>> wrote:
>
>     Hi,
>
>     I've got some difficulties parsing "large" xml files (> 100MB).
>     A plain SAX parser, as provided by hexpat, is fine. However,
>     constructing a tree consumes too much memory on a 32bit machine.
>
>     see http://trac.informatik.uni-__bremen.de:8080/hets/ticket/__1248
>     <http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248>
>
>     I suspect that sharing strings when constructing trees might greatly
>     reduce memory requirements. What are suitable libraries for string
>     pools?
>
>     Before trying to implement something myself, I'ld like to ask who
>     else has tried to process large xml files (and met similar memory
>     problems)?
>
>     I have not yet investigated xml-conduit and hxt for our purpose.
>     (These look scary.)
>
>     In fact, I've basically used the content trees from "The (simple)
>     xml package" and switching to another tree type is no fun, in
>     particular if this gains not much.
>
>     Thanks Christian
>     _________________________________________________
>     Glasgow-haskell-users mailing list
>     Glasgow-haskell-users at haskell.__org
>     <mailto:Glasgow-haskell-users at haskell.org>
>     http://www.haskell.org/__mailman/listinfo/glasgow-__haskell-users
>     <http://www.haskell.org/mailman/listinfo/glasgow-haskell-users>
>



More information about the Glasgow-haskell-users mailing list