[Haskell-cafe] Re: XML (HXML) parsing :: GHC 6.8.3 space leak from 2000

Lev Walkin vlm at lionet.info
Fri Sep 19 21:24:22 EDT 2008


Lev Walkin wrote:
> Simon Marlow wrote:
>> Lev Walkin wrote:
>>
>>> I wondered why would a contemporary GHC 6.8.3 exhibit such a leak?
>>> After all, the technique was known in 2000 (and afir by Wadler in '87)
>>> and one would assume Joe English's reference to "most other Haskell
>>> systems" ought to mean GHC.
>>
>> Thanks for this nice example - Don Stewart pointed me to it, and  
>> Simon PJ and I just spent some time this morning diagnosing it.
>>
>> Incedentally, with GHC 6.8 you can just run the program with "+RTS 
>> -hT" to get a basic space profile, there's no need to compile it for 
>> profiling - this is tremendously useful for quick profiling jobs.  And 
>> in this case we see the the heap is filling up with (:) and Tree 
>> constructors, no thunks.
>>
>> Here's the short story:  GHC does have the space leak optimisation you 
>> refer to, and it is working correctly, but it doesn't cover all the 
>> cases you might want it to cover.  In particular, optimisations 
>> sometimes interact badly with the space leak avoidance, and that's 
>> what is happening here.  We've known about the problem for some time, 
>> but this is the first time I've seen a nice small example that 
>> demonstrates it.
>>
>>>     -- Lazily build a tree out of a sequence of tree-building events
>>>     build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
>>>     build (Start str : es) =
>>>             let (es', subnodes) = build es
>>>                 (spill, siblings) = build es'
>>>             in (spill, (Tree str subnodes : siblings))
>>>     build (Leaf str : es) =
>>>             let (spill, siblings) = build es
>>>             in (spill, Tree str [] : siblings)
>>>     build (Stop : es) = (es, [])
>>>     build [] = ([], [])
> 
> [skip]
> 
>> We don't know of a good way to fix this problem.  I'm going to record 
>> this example in a ticket for future reference, though.
> 
> Simon,
> 
> is there a way, perhaps, to rewrite this expression to avoid leaks?
> An ad-hoc will do, perhaps split in two modules to avoid intramodular
> optimizations?

Tried to avoid this misoptimization by using explicit fst, and
it worked on my synthesized input (probably benefiting of CSE):

build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
build (Start str : es) =
         let (_, subnodes) = build es
             (spill, siblings) = build . fst . build $ es
         in (spill, (Tree str subnodes : siblings))
build (Leaf str : es) =
         let (spill, siblings) = build es
         in (spill, Tree str [] : siblings)
build (Stop : es) = (es, [])
build [] = ([], [])

However, while this solution works on a synthesized input (cycle [...]),
it still has memory leak when taken into HXML environment which
operates on files (why?).

Only when I also added Ketil Malde's `par` based hack I finally
was able to parse the big XML file without a space leak. Here's
the diff to HXML 0.2:

======================================================================
--- TreeBuild.hs.old	2008-09-19 17:01:30.000000000 -0700
+++ TreeBuild.hs	2008-09-19 17:04:15.000000000 -0700
@@ -20,6 +20,7 @@
  import XMLParse
  import XML
  import Tree
+import Control.Parallel

  --
  -- TODO: add basic error-checks: matching end-tags, ensure input exhausted
@@ -43,8 +44,9 @@
  	addTree t es		= let (s,es') = build es in pair (cons t s) es'
  	build [] 		= pair nil []
  	build (e:es) = case e of
-	    StartEvent gi atts	-> let (c,es') = build es
-	    			   in addNode (ELNode gi atts) c es'
+	    StartEvent gi atts  -> let (c, es') = build es
+				       sbl = build . snd . build $ es
+				   in sbl `par` (cons (tree (ELNode gi atts) c) (fst sbl), snd sbl)
  	    EndEvent _		-> pair nil es
  	    EmptyEvent gi atts	-> addLeaf (ELNode gi atts) es
  	    TextEvent s		-> addLeaf (TXNode s) es
=======================================================================

With that, a 45 mb XML is parsed in constant space in

G4 1.5GHz: 1 minute 48 seconds, taking 16 mb RAM
Pentium D 2x3.0GHz: 12 seconds, taking 9 mb RAM

Compared to 0.2s `wc -l`.

If you
   * remove `par` from there or
   * replace (build . snd . build $ es) with just (es') or
   * forget to specify -threaded (-smp) during ghc compilation
then the space leak will exhibit itself again.

However, removing -threaded will still make this code run without leak
on synthesized input (StartEvent "" [] : cycle [TextEvent ""]).

I believe there's a way to get rid of `par`, perhaps by wrapping
this tree building thing into a optimization-unfriendly monad?
But I don't know how to approach this. Any help?

-- 
Lev Walkin
vlm at lionet.info


More information about the Haskell-Cafe mailing list