[Haskell-cafe] memory needed for SAX parsing XML

Tue Apr 20 04:08:36 EDT 2010

Jason Dagit wrote:
> 
> 
> On Mon, Apr 19, 2010 at 3:01 AM, Daniil Elovkov 
> <daniil.elovkov at googlemail.com <mailto:daniil.elovkov at googlemail.com>> 
> wrote:
> 
>     Hello haskellers!
> 
>     I'm trying to process an xml file with as little footprint as
>     possible. SAX is alright for my case, and I think that's the
>     lightest way possible. So, I'm looking at HaXml.SAX
> 
>     I'm surprised to see that it takes about 56-60 MB of ram. This seems
>     constant relative to xml file size, which is expected. Only slightly
>     depends on it as I recursively traverse the list of sax events. But
>     it seems like too much.
> 
> 
> For me these sorts of problems always involve investigation into the 
> root cause.  I'm just not good enough at predicting what is causing the 
> memory consumption.  Thankfully, GHC has great tools for this sort of 
> investigative work.  The book real-world haskell documents how to use 
> those tools:
> http://book.realworldhaskell.org/read/profiling-and-optimization.html
> 
> If you haven't already, I highly recommend looking at the profiling 
> graphs.  See if you can figure out if your program has any space leaks.

No, I didn't profile it yet. I just made up the simplest example (while 
evaluating a few xml parsers) and saw the result.

Given the triviality of the example I considered it could be a "feature" 
of HaXml SAX parser, and that somebody would say that it is indeed so.

> 
> 
>     The size of the file is from 1MB to 20MB.
> 
>     The code is something like this
> 
>     main = do
>        (fn:_) <- getArgs
>        h <- openFile fn ReadMode
>        c <- hGetContents h
>        let out = proc $ fst $ saxParse fn c
>        putStrLn out
>        getChar
> 
> 
> For such a simple program you won't run into any problem with lazy IO, 
> but as your program grows in complexity it will very likely come back to 
> bite you.  If you're not familiar with lazy IO, I'm referring to the 
> hGetContents.  Some example problems:

> ....

Thanks, I'm aware of that. Speaking of this, I've always felt a bit 
uncomfortable about lazy IO. I've just looked at the safe-lazy-io 
package now.

> I think iteratees are slowly catching on as an alternative to lazy io.  
> Basically, the iteratee approach uses a left fold style to stream the 
> data and process it in chunks, including some exception handling.  
> Unfortunately, I think it may also require a special sax parser that is 
> specifically geared towards iteratee use.  Having an iteratee based sax 
> parser would make processing large xml streams very convenient in 
> haskell.  Hint, hint, if you want to write a library :)  (Or, maybe it 
> exists, I admit that I haven't checked.)

Iteratees  seem like a natural thing if we want to completely avoid 
unsafeInterleaveIO. But the presence of the latter is so good for 
modularity.

We can glue IO code and pure code of the type String -> a so seamlessly. 
In case of iteratees, as I understand, pure functions of the type String 
-> a would no longer be usable with IO String result. The signature (and 
the code itself) would have to be changed to be left fold.

Another (additional) approach would be to encapsulate unsafeInterleaveIO 
within some routine and not let it go out into the wild.

lazilyDoWithIO :: IO a -> (a -> b) -> IO b

It would use unsafeInterleave internally but catch all IO errors within 
itself.

I wonder if this is a reasonable idea? Maybe it's already done?
So the topic is shifting...

-- 
Daniil Elovkov