Proposal: containers: add indexing operations for Set

Tue Jul 10 17:31:00 CEST 2012

Hi,

> On Tue, Jul 10, 2012 at 1:59 AM, Johan Tibell <johan.tibell at gmail.com> wrote:
> > On Thu, Jul 5, 2012 at 8:01 AM, Milan Straka <fox at ucw.cz> wrote:
> >> Any balanced tree can support indexing API, as long as every subtree
> >> stores its size. That is obviously trivially true for size-balanced
> >> trees, as the subtree sizes are used for balancing. If for example AVL
> >> trees were used, sizes would have to be stored explicitly in every node
> >> for indexing API to work fast.
> >
> > There might be some other benefits to adding the size as well. IIRC
> > union is faster if the first set is smaller than the second set. If we
> > can look up the sizes in O(1) time we can rearrange the arguments to
> > make sure this is the case.
> 
> I've always wondered why Data.Map doesn't do this by default.  E.g. I have:
> 
> -- | Map.union says it's more efficient with @big `union` small@, so this one
> -- flips the args to be more efficient.  It's still left-biased.
> union2 :: (Ord k) => Map.Map k v -> Map.Map k v -> Map.Map k v
> union2 m1 m2
>     | Map.size m1 < Map.size m2 = m2 `right` m1
>     | otherwise = m1 `Map.union` m2
>     where right = Map.unionWith (\_ a -> a)

there are several problems with this.

Most notably, it is not true that "set1 < set2" is always better than
"set1 > set2" -- if you run the
containers/benchmarks/SetOperations/bench-SetOperations-Map benchmark,
you can see that for different input distributions, "set1 < set2" is
sometimes faster and sometimes slower than "set1 > set2". For various
input distributions, difference between "set1 < set2" and "set1 > set2"
are -7.5%, -25%, -35%, -17%, +20%, +8%, -13%.

Also union is more efficient than (unionWith const), because is
preserves sharing of the nodes present in both sets. That could be
solved by creating another implementation of right-biased unionR which
also preserves sharing.

Probably the best idea is just to remove the documentation comment about
"bigset `union` smallset" being faster, as it is not true :)

Cheers,
Milan