[Haskell-cafe] Re: XML (HXML) parsing :: fixed GHC 6.8.3 space leak from 2000

Lev Walkin vlm at lionet.info
Tue Sep 23 10:47:33 EDT 2008


This solution seem to provide a practical alternative to pusing 
datatypes for streaming XML.

http://gemo.futurs.inria.fr/events/PLANX2008/papers/p10.pdf

Lev Walkin wrote:
> Marc A. Ziegert wrote:
> 
>>>> We don't know of a good way to fix this problem.  I'm going to 
>>>> record this example in a ticket for future reference, though.
>>> Simon,
>>>
>>> is there a way, perhaps, to rewrite this expression to avoid leaks?
>>> An ad-hoc will do, perhaps split in two modules to avoid intramodular
>>> optimizations?
>>>
>>> -- 
>>> Lev Walkin
>>
>> finally... there is a way! :D
>>
>> hmm... this was a nice puzzle ;)
>>
>> i've tried several times (and hours!) to implement a Continuation (not 
>> monad) based solution, but finally i developed this tricky but elegant 
>> foldr solution...
>> i built the parser around this type:
>>   type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either 
>> [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree 
>> String)]
>>
>> it is based on the following thought:
>> the tuple
>>   (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces 
>> the list ps to be processed entirely before rs.
>> is equipollent to
>>   (fmap Right ps++[Left rs])::[Either [Rest] Processed]
>> , but the latter is easier to handle ...at least if you can't trust 
>> the GC.
> 
> Marc, you are my hero of the month!
> 
> I can't say I understood this solution before applying it back to
> HXML-0.2, but it surely worked and made quite observable 20%
> difference in performance:
> 
> 9.8 seconds on my 45 megabyte XML test, running in half the space
> (4m) compared to my parallel version based on Ketil Malde's suggestion
> (which was 12 seconds on two cores (though, one core was almost
> idling, `par` was used purely for its side-effect)).
> 
> To those who wants to parse XML in constant space, attached find
> a patch to HXML-0.2 which fixes a space leak.
> 
> However, I am still a bit surprized to discover there is not an order
> of magnitude difference between `par`-based version and your builder.
> 
> While the foldr-based builder is clearly superior, one can't
> help but wonder whether is it `par` that is so efficient compared
> to crunching through Eithers, or there's some other bottleneck
> in the code. Will profile a bit later.
> 
> The XML parsing space leak was declared in HXML back in 2000 and
> lingered in the code for 8 years. Good riddance!
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe



More information about the Haskell-Cafe mailing list