[Haskell-cafe] XML (HXML) parsing :: GHC 6.8.3 space leak from 2000

Lev Walkin vlm at lionet.info
Thu Sep 18 02:23:24 EDT 2008


Recently I had to process some multi-megabyte XML files. Tried a few
Haskell XML parsers (HaXML, HXT, HXML) but all of them were exhibiting
very pronounced space leaks, and all but HXML were too strict for my
input. Judging by the code and stated objectives, Joe English's HXML
(0.2, circa 2003) looked more promising for hacking around so I tried
to figure out the space leak problem.

It wasn't too long to find out the source of a problem, the buildTree
function in TreeBuild.hs. In fact, the very annotation to that function
reads as follows:

-- %%% There is apparently a space leak here, but I can't find it.
-- %%% Update 28 Feb 2000: There is a leak, but it's fixed
-- %%% by a well-known GC implementation technique.  Hugs 98 happens
-- %%% not to implement this technique, but STG Hugs (and most other
-- %%% Haskell systems) do implement it.
-- %%% Thanks to Simon Peyton-Jones, Malcolm Wallace, Colin Runcinman
-- %%% Mark Jones, and others for investigating this.

And there's some more in the accompanying documentation:

     + Under Hugs 98 only, suffers a serious space fault.

I wondered why would a contemporary GHC 6.8.3 exhibit such a leak?
After all, the technique was known in 2000 (and afir by Wadler in '87)
and one would assume Joe English's reference to "most other Haskell
systems" ought to mean GHC.

But here we are, in 2008 I still can't get HXML to not to leak like
a hose while lazily parsing my file. In fact, I can't get my 45-megabyte
file parsed on my 1GB RAM system without swapping.

So I went ahead and extracted the code and stripped all XML related
junk to reproduce the problem with a minimal test case. Attached please
find a single tree.hs module which is just sufficient to demonstrate a
memory leak. Here's a culprit function:

	data TreeEvent =
		Start String	-- Branch off a new subtree
		| Stop		-- Stop branching and return 1 level
		| Leaf String	-- A simple leaf without children
		deriving Show

	-- Lazily build a tree out of a sequence of tree-building events
	build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
	build (Start str : es) =
	        let (es', subnodes) = build es
	            (spill, siblings) = build es'
	        in (spill, (Tree str subnodes : siblings))
	build (Leaf str : es) =
	        let (spill, siblings) = build es
	        in (spill, Tree str [] : siblings)
	build (Stop : es) = (es, [])
	build [] = ([], [])

In fact, the attached module implements almost verbatim the code from
an old Joe's request (circa 2000):
	http://www.cse.unsw.edu.au/~dons/haskell-1990-2000/msg06086.html
but my version is a bit more self-sufficient for the new folks who'd
like to quickly test it on their system.

Am I really ignorant of some important GHC optimization options
(tried -O2/-O3), or is this indeed a serious problem to tackle?

-- 
Lev Walkin
vlm at lionet.info
-------------- next part --------------
module Main where

-- A simple tree to hold hierarchial XML-like data structure

data Tree a = Tree a [Tree a] deriving Show

-- A member of the event sequence which attempts to build a tree.
-- For example, the following sequence
--     [Start "top", Leaf "leaf", Start "sub", Stop, Stop]
-- should correspond to the following tree:
--     Tree "top" [Tree "leaf" [], Tree "sub" []]

data TreeEvent = Start String	-- Branch off a new subtree
		| Stop		-- Stop branching and return 1 level
		| Leaf String	-- A simple leaf without children
		deriving Show

-- Lazy printing of an infinite tree building process

main = print . snd . build $ Start "top" : cycle [Leaf "sub"]

-- Convert a stream of tree building events
-- into a list of unconsumed events and a constructed tree body.

type UnconsumedEvent = TreeEvent	-- Alias for program documentation

build :: [TreeEvent] -> ([UnconsumedEvent], [Tree String])
build (Start str : es) =
	let (es', subnodes) = build es
	    (spill, siblings) = build es'
        in (spill, (Tree str subnodes : siblings))
build (Leaf str : es) =
        let (spill, siblings) = build es
        in (spill, Tree str [] : siblings)
build (Stop : es) = (es, [])
build [] = ([], [])

-- Stricter version of build, never terminates on infinite input,
-- but exhibits no space leaks whatsoever.
build' f (Start str : es) =
	let (es', subnodes) = build' id es
	in build' ((Tree str subnodes) :) es'
build' f (Leaf str : es) = build' ((Tree str []) :) es
build' f (Stop : es) = (es, f [])
build' f [] = ([], f [])


More information about the Haskell-Cafe mailing list