Space faults in HaXml
Joe English
jenglish@flightlab.com
Sun, 04 Nov 2001 18:36:06 -0800
Last week, Dmitry Astapov posted a message to haskell-cafe
describing a HaXml program that was running out of memory
on large inputs. I've done some investigation on this;
here's what I've discovered so far:
+ The HaXml parser is overly eager; it doesn't produce
a value until the entire input has been read.
The heap profile for Dmitry's program shows heap
usage climbing steadily up to a large peak as the
document is being parsed, then a sharp drop as
output begins, then climbs again as the output
is produced. I suspect that this behaviour is typical
of most HaXml programs.
I replaced the parser with HXML (recently announced here);
this flattens out the first peak, but the second one persists.
(HXML uses an incremental parser -- instead of parsing a Document,
it parses individual Events (start-tag, character data, end-tag,
et cetera) and assembles them into a tree with a separate
function.)
+ nhc98's heap profiler is fantastic.
+ Under nhc98 (v1.10), 'putStr' doesn't seem to like overly-long
strings. This leads to trouble since the HaXml main driver uses
a single call to 'hPutStrLn' to write the entire output document.
Replacing this with 'mapM_ putChar' (or using GHC 5.02), fixes
this problem, but the program still leaks.
+ The nhc98 heap profiler is amazingly useful.
+ HaXml uses the Hughes & Peyton Jones Pretty printer to format
the output. This appears to have a slow leak. (Later tests
show that the peak consists of "void" heap cells from the
Pretty module, mostly suspended calls to Prelude.Int-
and Prelude.Int+.)
Figuring that it's overkill anyway I replaced it with a simpler
serializer.
+ After this change there's *still* a space fault, which
with the help of nhc's profiler (which, BTW, is great), I
tracked down to the HaXml 'cat' combinator:
cat :: [a -> [b]] -> (a -> [b])
cat fs = \e -> concat [f e | f <- fs]
Earlier I had reported that rewriting Dmitry's test case
to use the HXML internal representation directly instead
of converting to HaXml's representation fixed the space leak.
As it turns out, I didn't implement the 'cat' combinator,
but instead used a binary concatenation operator:
(+++) :: (a -> [b]) -> (a -> [b]) -> (a -> [b])
f +++ g = \x -> f x ++ g x
and my implementation of mkElem had the signature
mkElem :: String -> CFilter -> CFilter
instead of HaXml's
mkElem :: String -> [CFilter] -> CFilter.
which uses 'cat' to process the second argument.
Backporting this to HaXml:
mkElem' name cf = \x -> [CElem (Elem h [] (cf x))]
and modifying Dmitry's test case to use mkElem' and +++
instead of mkElem and cat finally fixes the problem.
I'm still not sure *why* this does the trick --
hopefully somebody smarter can explain that --
but the modified program runs in bounded space,
no matter how large the input file is.
(It even works in Hugs, which I found surprising, since
the HXML tree builder has a known problem when run with
Hugs' garbage collector.)
--Joe English
jenglish@flightlab.com