[Haskell-cafe] memory needed for SAX parsing XML

Mon Apr 19 12:04:59 EDT 2010

On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov <
daniil.elovkov at googlemail.com> wrote:

> Hello haskellers!
>
> I'm trying to process an xml file with as little footprint as possible. SAX
> is alright for my case, and I think that's the lightest way possible. So,
> I'm looking at HaXml.SAX
>
> I'm surprised to see that it takes about 56-60 MB of ram. This seems
> constant relative to xml file size, which is expected. Only slightly depends
> on it as I recursively traverse the list of sax events. But it seems like
> too much.
>

For me these sorts of problems always involve investigation into the root
cause.  I'm just not good enough at predicting what is causing the memory
consumption.  Thankfully, GHC has great tools for this sort of investigative
work.  The book real-world haskell documents how to use those tools:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

If you haven't already, I highly recommend looking at the profiling graphs.
See if you can figure out if your program has any space leaks.

>
> The size of the file is from 1MB to 20MB.
>
> The code is something like this
>
> main = do
>    (fn:_) <- getArgs
>    h <- openFile fn ReadMode
>    c <- hGetContents h
>    let out = proc $ fst $ saxParse fn c
>    putStrLn out
>    getChar
>

For such a simple program you won't run into any problem with lazy IO, but
as your program grows in complexity it will very likely come back to bite
you.  If you're not familiar with lazy IO, I'm referring to the
hGetContents.  Some example problems:
1) If you opened many files this way, you could run out of file handles
(lazy IO closing of handles is unpredictable, but file handles are a scarce
resource).  The safe-io package on hackage can help you avoid this
particular pitfall.
2) Reading of the file will happen during your pure code.  This implies that
IO exceptions can happen in your pure code.  It also means that in some ways
you'll be able to observe side-effects in your pure code.
3) If you were to reference 'c' from two places in main, the GC would not
collect any of it until both references were collectable.  To avoid that
leak, you'd need to load the data twice to avoid the memory leak.

I'm sure there are other things that can go wrong that i've missed.

I think iteratees are slowly catching on as an alternative to lazy io.
Basically, the iteratee approach uses a left fold style to stream the data
and process it in chunks, including some exception handling.  Unfortunately,
I think it may also require a special sax parser that is specifically geared
towards iteratee use.  Having an iteratee based sax parser would make
processing large xml streams very convenient in haskell.  Hint, hint, if you
want to write a library :)  (Or, maybe it exists, I admit that I haven't
checked.)

I hope that helps,
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100419/c8ed908c/attachment.html