haskell xml parsing for larger files?

Chris Smith cdsmith at gmail.com
Thu Feb 20 15:01:50 UTC 2014


Ah, I'd misunderstood your question, and thought you were looking for a
sax-like alternative.
On Feb 20, 2014 6:57 AM, "Christian Maeder" <Christian.Maeder at dfki.de>
wrote:

> I've just tried:
>
>   import Text.HTML.TagSoup
>   import Text.HTML.TagSoup.Tree
>
>   main :: IO ()
>   main = getContents >>= putStr . renderTags . flattenTree . tagTree .
> parseTags
>
> which also ends with the getMBlock error.
> Only "renderTags . parseTags" works fine (like the hexpat SAX parser).
>
> Why should tagsoup be better suited for building trees from large files?
>
> C.
>
> Am 20.02.2014 15:30, schrieb Chris Smith:
>
>> Have you looked at tagsoup?
>>
>> On Feb 20, 2014 3:30 AM, "Christian Maeder" <Christian.Maeder at dfki.de
>> <mailto:Christian.Maeder at dfki.de>> wrote:
>>
>>     Hi,
>>
>>     I've got some difficulties parsing "large" xml files (> 100MB).
>>     A plain SAX parser, as provided by hexpat, is fine. However,
>>     constructing a tree consumes too much memory on a 32bit machine.
>>
>>     see http://trac.informatik.uni-__bremen.de:8080/hets/ticket/__1248
>>     <http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248>
>>
>>     I suspect that sharing strings when constructing trees might greatly
>>     reduce memory requirements. What are suitable libraries for string
>>     pools?
>>
>>     Before trying to implement something myself, I'ld like to ask who
>>     else has tried to process large xml files (and met similar memory
>>     problems)?
>>
>>     I have not yet investigated xml-conduit and hxt for our purpose.
>>     (These look scary.)
>>
>>     In fact, I've basically used the content trees from "The (simple)
>>     xml package" and switching to another tree type is no fun, in
>>     particular if this gains not much.
>>
>>     Thanks Christian
>>     _________________________________________________
>>     Glasgow-haskell-users mailing list
>>     Glasgow-haskell-users at haskell.__org
>>     <mailto:Glasgow-haskell-users at haskell.org>
>>     http://www.haskell.org/__mailman/listinfo/glasgow-__haskell-users
>>     <http://www.haskell.org/mailman/listinfo/glasgow-haskell-users>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20140220/8141cd36/attachment.html>


More information about the Glasgow-haskell-users mailing list