[Haskell-cafe] Re: XML (HXML) parsing :: fixed GHC 6.8.3 space
leak from 2000
Lev Walkin
vlm at lionet.info
Tue Sep 23 10:47:33 EDT 2008
This solution seem to provide a practical alternative to pusing
datatypes for streaming XML.
http://gemo.futurs.inria.fr/events/PLANX2008/papers/p10.pdf
Lev Walkin wrote:
> Marc A. Ziegert wrote:
>
>>>> We don't know of a good way to fix this problem. I'm going to
>>>> record this example in a ticket for future reference, though.
>>> Simon,
>>>
>>> is there a way, perhaps, to rewrite this expression to avoid leaks?
>>> An ad-hoc will do, perhaps split in two modules to avoid intramodular
>>> optimizations?
>>>
>>> --
>>> Lev Walkin
>>
>> finally... there is a way! :D
>>
>> hmm... this was a nice puzzle ;)
>>
>> i've tried several times (and hours!) to implement a Continuation (not
>> monad) based solution, but finally i developed this tricky but elegant
>> foldr solution...
>> i built the parser around this type:
>> type FoldR_Builder = (TreeEvent,[UnconsumedEvent]) -> [Either
>> [UnconsumedEvent] (Tree String)] -> [Either [UnconsumedEvent] (Tree
>> String)]
>>
>> it is based on the following thought:
>> the tuple
>> (rs,ps)::([Rest],[Processed]) -- with the restriction, which forces
>> the list ps to be processed entirely before rs.
>> is equipollent to
>> (fmap Right ps++[Left rs])::[Either [Rest] Processed]
>> , but the latter is easier to handle ...at least if you can't trust
>> the GC.
>
> Marc, you are my hero of the month!
>
> I can't say I understood this solution before applying it back to
> HXML-0.2, but it surely worked and made quite observable 20%
> difference in performance:
>
> 9.8 seconds on my 45 megabyte XML test, running in half the space
> (4m) compared to my parallel version based on Ketil Malde's suggestion
> (which was 12 seconds on two cores (though, one core was almost
> idling, `par` was used purely for its side-effect)).
>
> To those who wants to parse XML in constant space, attached find
> a patch to HXML-0.2 which fixes a space leak.
>
> However, I am still a bit surprized to discover there is not an order
> of magnitude difference between `par`-based version and your builder.
>
> While the foldr-based builder is clearly superior, one can't
> help but wonder whether is it `par` that is so efficient compared
> to crunching through Eithers, or there's some other bottleneck
> in the code. Will profile a bit later.
>
> The XML parsing space leak was declared in HXML back in 2000 and
> lingered in the code for 8 years. Good riddance!
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
More information about the Haskell-Cafe
mailing list