Minor inefficiency in Data.Map and Data.Set?

Simon Marlow marlowsd at gmail.com
Fri Jan 22 07:00:51 EST 2010


On 27/12/2009 18:24, Neil Mitchell wrote:
> Hi,
>
> Sounds like we should apply then - unless anyone has a really good
> reason why not to. The code should be simple and correct, and we
> shouldn't ever rely on things like CSE.

On the other hand, it sounds like it adds a new invariant, which at the 
least should be documented in comments.

If someone sends a darcs patch, I'll apply.

Cheers,
	Simon


> Thanks, Neil
>
> 2009/12/24 Edward Kmett<ekmett at gmail.com>:
>> On Wed, Dec 23, 2009 at 4:31 PM, Neil Mitchell<ndmitchell at gmail.com>  wrote:
>>>
>>> Hi Tyson,
>>>
>>> Whenever I see the word inefficiency, I look for the benchmark that
>>> proves it :-)
>>
>> Hey Neil,
>> I've spent quite a bit of time poring over the code in question trying to eke out performance when I was looking at ways to get it to perform nicely for unboxed data structures, so I'll try to serve as an independent set of eyes.
>>>
>>> Have you benchmarked these two alternatives? If you can show that
>>>
>>> there is a measurable difference I'm sure someone will sit up and take
>>>
>>> note. Personally, I know nothing about the code involved, so you'd
>>>
>>> need to find someone who did to check your optimisation is valid
>>
>> I'll vouch for the validity of the code change. Basically his code avoids trying to do the top level rotation when gluing 2 trees that are already within 1 level of each other together -- the invariant to be able to call 'glue'. Its a good point, and it checks out as correct that the call to balance is superfluous. I even quickchecked about a hundred thousand trees to make sure that the behavior doesn't change, but you can get there by programmatic transformation to satisfy for more rigorous among us:
>> A 'sufficiently smart compiler' could spot that if it inlines balance in both locations, then CSE's the comparison between sizes, that the case for the comparison between sizes has already been done, then it could simplify the balance cases to just construct the Bin without rebalancing, and another CSE would then do the rest of the transformation, but it seems like a needlessly complex path, and while I haven't gone through the resulting STG output to see if it is actually occurring with optimizations enabled on the containers library, I'm somewhat dubious, given the number of phases that would have to happen in just the right order for this to be the case.
>> More interestingly, if you start playing around with variants on these modules, like the one I was working on that unboxed the key/values using some fancy view machinery, the compiler would have to work ridiculously hard to spot the correct invariants across the inlining of both the call and the particular view, so I've already applied this patch to unboxed-containers in my working repository.
>> I think that relying on the compiler to get that right seems a lot less straightforward than just fixing the algorithm to do the right amount of work. Considering that it is true to the style of the rest of the code in those modules to be explicit about what work is actually necessary and where by calling special case rebalancing combinators only when appropriate.
>>>
>>> (if it makes things go faster.
>>
>> As for benchmarks, frankly it'll vanish in the noise, the worst case is a few of extra stack frames getting constructed, a redundant addition, and some unnecessary comparisons in balance once per operation, and not at each level of the tree. I do support patching it just for correctness though.
>> Glue gets called internally across a very wide array of operations, and in some sense the Data.Map/Data.Set code has become the de facto reference implementation for how to implement trees of bounded balance, so it'd be nice to have them be correct.
>> -Edward Kmett
>>>
>>>
>>> Thanks, Neil
>>>
>>> On Mon, Dec 21, 2009 at 5:38 AM, Tyson Whitehead<twhitehead at gmail.com>  wrote:
>>>> I was looking at the Data.Map and Data.Set implementations and I think there
>>>> is a slight inefficiency.  Consider the Data.Map code
>>>>
>>>> {--------------------------------------------------------------------
>>>>   [glue l r]: glues two trees together.
>>>>   Assumes that [l] and [r] are already balanced with respect to each other.
>>>> --------------------------------------------------------------------}
>>>> glue :: Map k a ->  Map k a ->  Map k a
>>>> glue Tip r = r
>>>> glue l Tip = l
>>>> glue l r
>>>>   | size l>  size r = let ((km,m),l') = deleteFindMax l in balance km m l' r
>>>>   | otherwise       = let ((km,m),r') = deleteFindMin r in balance km m l r'
>>>>
>>>> The call to balance should not be needed as subtracting one item from the non-
>>>> smaller of them cannot thrown them out of balance given they are already in
>>>> balance with one another.  The only way they could be made out of balance
>>>> would be to subtract an item for smaller one.  The replacement code is
>>>>
>>>> glue l r
>>>>   | sizeL>  sizeR = let ((km,m),l') = deleteFindMax l in Bin sizeX km m l' r
>>>>   | otherwise     = let ((km,m),r') = deleteFindMin r in Bin sizeX km m l r'
>>>>   where
>>>>     sizeL = size l
>>>>     sizeR = size r
>>>>     sizeX = sizeL + sizeR
>>>>
>>>> The Data.Set code is identical except for the key.
>>>>
>>>> Cheers!  -Tyson
>>>> _______________________________________________
>>>> Libraries mailing list
>>>> Libraries at haskell.org
>>>> http://www.haskell.org/mailman/listinfo/libraries
>>>>
>>> _______________________________________________
>>> Libraries mailing list
>>> Libraries at haskell.org
>>> http://www.haskell.org/mailman/listinfo/libraries
>>
>>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries



More information about the Libraries mailing list