[Haskell-cafe] Re: XML (HXML) parsing :: GHC 6.8.3 space leak
vlm at lionet.info
Fri Sep 19 21:24:22 EDT 2008
Lev Walkin wrote:
> Simon Marlow wrote:
>> Lev Walkin wrote:
>>> I wondered why would a contemporary GHC 6.8.3 exhibit such a leak?
>>> After all, the technique was known in 2000 (and afir by Wadler in '87)
>>> and one would assume Joe English's reference to "most other Haskell
>>> systems" ought to mean GHC.
>> Thanks for this nice example - Don Stewart pointed me to it, and
>> Simon PJ and I just spent some time this morning diagnosing it.
>> Incedentally, with GHC 6.8 you can just run the program with "+RTS
>> -hT" to get a basic space profile, there's no need to compile it for
>> profiling - this is tremendously useful for quick profiling jobs. And
>> in this case we see the the heap is filling up with (:) and Tree
>> constructors, no thunks.
>> Here's the short story: GHC does have the space leak optimisation you
>> refer to, and it is working correctly, but it doesn't cover all the
>> cases you might want it to cover. In particular, optimisations
>> sometimes interact badly with the space leak avoidance, and that's
>> what is happening here. We've known about the problem for some time,
>> but this is the first time I've seen a nice small example that
>> demonstrates it.
>>> -- Lazily build a tree out of a sequence of tree-building events
>>> build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
>>> build (Start str : es) =
>>> let (es', subnodes) = build es
>>> (spill, siblings) = build es'
>>> in (spill, (Tree str subnodes : siblings))
>>> build (Leaf str : es) =
>>> let (spill, siblings) = build es
>>> in (spill, Tree str  : siblings)
>>> build (Stop : es) = (es, )
>>> build  = (, )
>> We don't know of a good way to fix this problem. I'm going to record
>> this example in a ticket for future reference, though.
> is there a way, perhaps, to rewrite this expression to avoid leaks?
> An ad-hoc will do, perhaps split in two modules to avoid intramodular
Tried to avoid this misoptimization by using explicit fst, and
it worked on my synthesized input (probably benefiting of CSE):
build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
build (Start str : es) =
let (_, subnodes) = build es
(spill, siblings) = build . fst . build $ es
in (spill, (Tree str subnodes : siblings))
build (Leaf str : es) =
let (spill, siblings) = build es
in (spill, Tree str  : siblings)
build (Stop : es) = (es, )
build  = (, )
However, while this solution works on a synthesized input (cycle [...]),
it still has memory leak when taken into HXML environment which
operates on files (why?).
Only when I also added Ketil Malde's `par` based hack I finally
was able to parse the big XML file without a space leak. Here's
the diff to HXML 0.2:
--- TreeBuild.hs.old 2008-09-19 17:01:30.000000000 -0700
+++ TreeBuild.hs 2008-09-19 17:04:15.000000000 -0700
@@ -20,6 +20,7 @@
-- TODO: add basic error-checks: matching end-tags, ensure input exhausted
@@ -43,8 +44,9 @@
addTree t es = let (s,es') = build es in pair (cons t s) es'
build  = pair nil 
build (e:es) = case e of
- StartEvent gi atts -> let (c,es') = build es
- in addNode (ELNode gi atts) c es'
+ StartEvent gi atts -> let (c, es') = build es
+ sbl = build . snd . build $ es
+ in sbl `par` (cons (tree (ELNode gi atts) c) (fst sbl), snd sbl)
EndEvent _ -> pair nil es
EmptyEvent gi atts -> addLeaf (ELNode gi atts) es
TextEvent s -> addLeaf (TXNode s) es
With that, a 45 mb XML is parsed in constant space in
G4 1.5GHz: 1 minute 48 seconds, taking 16 mb RAM
Pentium D 2x3.0GHz: 12 seconds, taking 9 mb RAM
Compared to 0.2s `wc -l`.
* remove `par` from there or
* replace (build . snd . build $ es) with just (es') or
* forget to specify -threaded (-smp) during ghc compilation
then the space leak will exhibit itself again.
However, removing -threaded will still make this code run without leak
on synthesized input (StartEvent ""  : cycle [TextEvent ""]).
I believe there's a way to get rid of `par`, perhaps by wrapping
this tree building thing into a optimization-unfriendly monad?
But I don't know how to approach this. Any help?
vlm at lionet.info
More information about the Haskell-Cafe