[Haskell-cafe] Re: XML (HXML) parsing :: GHC 6.8.3 space leak
from 2000
Lev Walkin
vlm at lionet.info
Fri Sep 19 21:24:22 EDT 2008
Lev Walkin wrote:
> Simon Marlow wrote:
>> Lev Walkin wrote:
>>
>>> I wondered why would a contemporary GHC 6.8.3 exhibit such a leak?
>>> After all, the technique was known in 2000 (and afir by Wadler in '87)
>>> and one would assume Joe English's reference to "most other Haskell
>>> systems" ought to mean GHC.
>>
>> Thanks for this nice example - Don Stewart pointed me to it, and
>> Simon PJ and I just spent some time this morning diagnosing it.
>>
>> Incedentally, with GHC 6.8 you can just run the program with "+RTS
>> -hT" to get a basic space profile, there's no need to compile it for
>> profiling - this is tremendously useful for quick profiling jobs. And
>> in this case we see the the heap is filling up with (:) and Tree
>> constructors, no thunks.
>>
>> Here's the short story: GHC does have the space leak optimisation you
>> refer to, and it is working correctly, but it doesn't cover all the
>> cases you might want it to cover. In particular, optimisations
>> sometimes interact badly with the space leak avoidance, and that's
>> what is happening here. We've known about the problem for some time,
>> but this is the first time I've seen a nice small example that
>> demonstrates it.
>>
>>> -- Lazily build a tree out of a sequence of tree-building events
>>> build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
>>> build (Start str : es) =
>>> let (es', subnodes) = build es
>>> (spill, siblings) = build es'
>>> in (spill, (Tree str subnodes : siblings))
>>> build (Leaf str : es) =
>>> let (spill, siblings) = build es
>>> in (spill, Tree str [] : siblings)
>>> build (Stop : es) = (es, [])
>>> build [] = ([], [])
>
> [skip]
>
>> We don't know of a good way to fix this problem. I'm going to record
>> this example in a ticket for future reference, though.
>
> Simon,
>
> is there a way, perhaps, to rewrite this expression to avoid leaks?
> An ad-hoc will do, perhaps split in two modules to avoid intramodular
> optimizations?
Tried to avoid this misoptimization by using explicit fst, and
it worked on my synthesized input (probably benefiting of CSE):
build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
build (Start str : es) =
let (_, subnodes) = build es
(spill, siblings) = build . fst . build $ es
in (spill, (Tree str subnodes : siblings))
build (Leaf str : es) =
let (spill, siblings) = build es
in (spill, Tree str [] : siblings)
build (Stop : es) = (es, [])
build [] = ([], [])
However, while this solution works on a synthesized input (cycle [...]),
it still has memory leak when taken into HXML environment which
operates on files (why?).
Only when I also added Ketil Malde's `par` based hack I finally
was able to parse the big XML file without a space leak. Here's
the diff to HXML 0.2:
======================================================================
--- TreeBuild.hs.old 2008-09-19 17:01:30.000000000 -0700
+++ TreeBuild.hs 2008-09-19 17:04:15.000000000 -0700
@@ -20,6 +20,7 @@
import XMLParse
import XML
import Tree
+import Control.Parallel
--
-- TODO: add basic error-checks: matching end-tags, ensure input exhausted
@@ -43,8 +44,9 @@
addTree t es = let (s,es') = build es in pair (cons t s) es'
build [] = pair nil []
build (e:es) = case e of
- StartEvent gi atts -> let (c,es') = build es
- in addNode (ELNode gi atts) c es'
+ StartEvent gi atts -> let (c, es') = build es
+ sbl = build . snd . build $ es
+ in sbl `par` (cons (tree (ELNode gi atts) c) (fst sbl), snd sbl)
EndEvent _ -> pair nil es
EmptyEvent gi atts -> addLeaf (ELNode gi atts) es
TextEvent s -> addLeaf (TXNode s) es
=======================================================================
With that, a 45 mb XML is parsed in constant space in
G4 1.5GHz: 1 minute 48 seconds, taking 16 mb RAM
Pentium D 2x3.0GHz: 12 seconds, taking 9 mb RAM
Compared to 0.2s `wc -l`.
If you
* remove `par` from there or
* replace (build . snd . build $ es) with just (es') or
* forget to specify -threaded (-smp) during ghc compilation
then the space leak will exhibit itself again.
However, removing -threaded will still make this code run without leak
on synthesized input (StartEvent "" [] : cycle [TextEvent ""]).
I believe there's a way to get rid of `par`, perhaps by wrapping
this tree building thing into a optimization-unfriendly monad?
But I don't know how to approach this. Any help?
--
Lev Walkin
vlm at lionet.info
More information about the Haskell-Cafe
mailing list