Minor inefficiency in Data.Map and Data.Set?

Edward Kmett ekmett at gmail.com
Wed Dec 23 21:23:26 EST 2009


On Wed, Dec 23, 2009 at 4:31 PM, Neil Mitchell <ndmitchell at gmail.com> wrote:

> Hi Tyson,
>
> Whenever I see the word inefficiency, I look for the benchmark that
> proves it :-)
>

Hey Neil,

I've spent quite a bit of time poring over the code in question trying to
eke out performance when I was looking at ways to get it to perform nicely
for unboxed data structures, so I'll try to serve as an independent set of
eyes.

Have you benchmarked these two alternatives? If you can show that

there is a measurable difference I'm sure someone will sit up and take

note. Personally, I know nothing about the code involved, so you'd

need to find someone who did to check your optimisation is valid


I'll vouch for the validity of the code change. Basically his code avoids
trying to do the top level rotation when gluing 2 trees that are already
within 1 level of each other together -- the invariant to be able to call
'glue'. Its a good point, and it checks out as correct that the call to
balance is superfluous. I even quickchecked about a hundred thousand trees
to make sure that the behavior doesn't change, but you can get there by
programmatic transformation to satisfy for more rigorous among us:

A 'sufficiently smart compiler' could spot that if it inlines balance in
both locations, then CSE's the comparison between sizes, that the case for
the comparison between sizes has already been done, then it could simplify
the balance cases to just construct the Bin without rebalancing, and another
CSE would then do the rest of the transformation, but it seems like a
needlessly complex path, and while I haven't gone through the resulting STG
output to see if it is actually occurring with optimizations enabled on the
containers library, I'm somewhat dubious, given the number of phases that
would have to happen in just the right order for this to be the case.

More interestingly, if you start playing around with variants on these
modules, like the one I was working on that unboxed the key/values using
some fancy view machinery, the compiler would have to work ridiculously hard
to spot the correct invariants across the inlining of both the call and the
particular view, so I've already applied this patch to unboxed-containers in
my working repository.

I think that relying on the compiler to get that right seems a lot less
straightforward than just fixing the algorithm to do the right amount of
work. Considering that it is true to the style of the rest of the code in
those modules to be explicit about what work is actually necessary and where
by calling special case rebalancing combinators only when appropriate.

(if it makes things go faster.


As for benchmarks, frankly it'll vanish in the noise, the worst case is a
few of extra stack frames getting constructed, a redundant addition, and
some unnecessary comparisons in balance once per operation, and not at each
level of the tree. I do support patching it just for correctness though.

Glue gets called internally across a very wide array of operations, and in
some sense the Data.Map/Data.Set code has become the de facto reference
implementation for how to implement trees of bounded balance, so it'd be
nice to have them be correct.

-Edward Kmett


>
> Thanks, Neil
>
> On Mon, Dec 21, 2009 at 5:38 AM, Tyson Whitehead <twhitehead at gmail.com>
> wrote:
> > I was looking at the Data.Map and Data.Set implementations and I think
> there
> > is a slight inefficiency.  Consider the Data.Map code
> >
> > {--------------------------------------------------------------------
> >  [glue l r]: glues two trees together.
> >  Assumes that [l] and [r] are already balanced with respect to each
> other.
> > --------------------------------------------------------------------}
> > glue :: Map k a -> Map k a -> Map k a
> > glue Tip r = r
> > glue l Tip = l
> > glue l r
> >  | size l > size r = let ((km,m),l') = deleteFindMax l in balance km m l'
> r
> >  | otherwise       = let ((km,m),r') = deleteFindMin r in balance km m l
> r'
> >
> > The call to balance should not be needed as subtracting one item from the
> non-
> > smaller of them cannot thrown them out of balance given they are already
> in
> > balance with one another.  The only way they could be made out of balance
> > would be to subtract an item for smaller one.  The replacement code is
> >
> > glue l r
> >  | sizeL > sizeR = let ((km,m),l') = deleteFindMax l in Bin sizeX km m l'
> r
> >  | otherwise     = let ((km,m),r') = deleteFindMin r in Bin sizeX km m l
> r'
> >  where
> >    sizeL = size l
> >    sizeR = size r
> >    sizeX = sizeL + sizeR
> >
> > The Data.Set code is identical except for the key.
> >
> > Cheers!  -Tyson
> > _______________________________________________
> > Libraries mailing list
> > Libraries at haskell.org
> > http://www.haskell.org/mailman/listinfo/libraries
> >
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/libraries/attachments/20091223/e94ad923/attachment-0001.html


More information about the Libraries mailing list