[Haskell-cafe] Re: XML (HXML) parsing :: GHC 6.8.3 space leak from 2000

Lev Walkin vlm at lionet.info
Tue Sep 23 09:35:02 EDT 2008


Marc A. Ziegert wrote:

>>> We don't know of a good way to fix this problem.  I'm going to record 
>>> this example in a ticket for future reference, though.
>> Simon,
>>
>> is there a way, perhaps, to rewrite this expression to avoid leaks?
>> An ad-hoc will do, perhaps split in two modules to avoid intramodular
>> optimizations?
>>
>> -- 
>> Lev Walkin
> 
> finally... there is a way! :D
> 
> hmm... this was a nice puzzle ;)
> 
> i've tried several times (and hours!) to implement a Continuation (not monad) based solution, but finally i developed this tricky but elegant foldr solution...
> i built the parser around this type:
>   type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree String)]
> 
> it is based on the following thought:
> the tuple
>   (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces the list ps to be processed entirely before rs.
> is equipollent to
>   (fmap Right ps++[Left rs])::[Either [Rest] Processed]
> , but the latter is easier to handle ...at least if you can't trust the GC.

Marc, you are my hero of the month!

I can't say I understood this solution before applying it back to
HXML-0.2, but it surely worked and made quite observable 20%
difference in performance:

9.8 seconds on my 45 megabyte XML test, running in half the space
(4m) compared to my parallel version based on Ketil Malde's suggestion
(which was 12 seconds on two cores (though, one core was almost
idling, `par` was used purely for its side-effect)).

To those who wants to parse XML in constant space, attached find
a patch to HXML-0.2 which fixes a space leak.

However, I am still a bit surprized to discover there is not an order
of magnitude difference between `par`-based version and your builder.

While the foldr-based builder is clearly superior, one can't
help but wonder whether is it `par` that is so efficient compared
to crunching through Eithers, or there's some other bottleneck
in the code. Will profile a bit later.

The XML parsing space leak was declared in HXML back in 2000 and
lingered in the code for 8 years. Good riddance!

-- 
Lev Walkin
vlm at lionet.info
-------------- next part --------------
--- TreeBuild.hs.old	2008-09-23 05:48:50.000000000 -0700
+++ TreeBuild.hs	2008-09-23 05:49:37.000000000 -0700
@@ -20,6 +20,7 @@
 import XMLParse
 import XML
 import Tree
+import Data.List (tails)
 
 --
 -- TODO: add basic error-checks: matching end-tags, ensure input exhausted
@@ -31,28 +32,29 @@
 -- %%% Haskell systems) do implement it.
 -- %%% Thanks to Simon Peyton-Jones, Malcolm Wallace, Colin Runcinman
 -- %%% Mark Jones, and others for investigating this.
+-- %%% Update 23 Sep 2008: Leak-free solution is provided by Marc A. Ziegert
 
 buildTree :: [XMLEvent] -> Tree XMLNode
 buildTree = constructTree Tree (:) []
 
 constructTree :: (XMLNode -> f -> t) -> (t -> f -> f) -> f -> [XMLEvent] -> t
 constructTree tree cons nil events = let
-	pair x y 		= (x,y)
-	addNode nd children es	= addTree (tree nd children) es
-	addLeaf nd es		= addTree (tree nd nil) es
-	addTree t es		= let (s,es') = build es in pair (cons t s) es'
-	build [] 		= pair nil []
-	build (e:es) = case e of
-	    StartEvent gi atts	-> let (c,es') = build es 
-	    			   in addNode (ELNode gi atts) c es'
-	    EndEvent _		-> pair nil es
-	    EmptyEvent gi atts	-> addLeaf (ELNode gi atts) es
-	    TextEvent s		-> addLeaf (TXNode s) es
-	    PIEvent tgt val	-> addLeaf (PINode tgt val) es
-	    CommentEvent txt	-> addLeaf (CXNode txt) es
-	    GERefEvent name	-> addLeaf (ENNode name) es
-	    ErrorEvent s	-> error s  -- %%% deal with this
-	in tree RTNode (fst (build events))
+	-- Marc A. Ziegert has provided a leek-free solution
+	build tes = let (ts_,ue_,_) = splitAtLeftDefault [] $ foldr builder [] [(te,ue)|ue@(te:_)<-tails tes] in ts_
+	builder (EndEvent _,ue) euts = (Left ue:euts)
+	builder (EmptyEvent gi atts,_) euts = (Right (tree (ELNode gi atts) nil):euts)
+	builder (TextEvent str,_) euts = (Right (tree (TXNode str) nil):euts)
+	builder (PIEvent tgt val,_) euts = (Right (tree (PINode tgt val) nil):euts)
+	builder (CommentEvent txt,_) euts = (Right (tree (CXNode txt) nil):euts)
+	builder (GERefEvent name,_) euts = (Right (tree (ENNode name) nil):euts)
+	builder (ErrorEvent s,_) euts = error s -- %%% deal with this
+	builder (StartEvent gi atts,_) euts = let (sub,_,euts') = splitAtLeftDefault [] euts
+		in (Right (tree (ELNode gi atts) sub):euts')
+	splitAtLeftDefault a0 [] = (nil,a0,[])
+	splitAtLeftDefault a0 (Right b:xs) =
+		let (bs,a,es) = splitAtLeftDefault a0 xs in (cons b bs,a,es)
+	splitAtLeftDefault _ (Left a:xs) = (nil,a,xs)
+	in tree RTNode (build events)
 
 serializeTree :: Tree XMLNode -> [XMLEvent]
 serializeTree tree = sn tree [] where


More information about the Haskell-Cafe mailing list