[Haskell-cafe] Data.IntMap union complexity

Sun Feb 26 02:28:42 CET 2012

On 2/24/12 3:40 AM, Christoph Breitkopf wrote:
> On Fri, Feb 24, 2012 at 4:48 AM, wren ng thornton<wren at freegeek.org>  wrote:
>> When the two maps are of vastly different sizes, O(min(m,n)) is a more
>> intuitive way to think about it. Since you only have to look at the places
>> where the spines clash, that will be on the order of the smaller map.
>
>  Folding insert might still be a win if one of the maps is very much 
smaller
> than the other, but since size is O(n) for Data.IntMap, there's no way to
> find out if that's the case.

It's possible (due to the constant factors I keep mentioning), though I 
wouldn't expect it to. The reasoning why is as follows:

In the smaller map every key will have some shared prefix with other 
keys. (Excepting degenerate cases like the empty map and singleton 
maps.) As such, calling insert with two keys that share a prefix means 
that you'll be traversing the larger map's spine for that prefix twice. 
Whereas, with the merge-based implementation we only traverse once.

Moreover, when performing repeated insertions, we have to reconstruct 
the structure of the smaller map, which can lead to producing excessive 
garbage over the merge-based implementation. For example, consider the 
case where we have keys K and L in the smaller map which diverge at some 
prefix P. If P does not lay in the spine of the larger map, then we will 
have to merge P with some other prefix Q in order to produce a Bin. 
Let's say that P and Q diverge at R (thus there is a Bin at R, with 
children P and Q). After inserting K we now have a map where there is a 
spine from R to K. Now, when we insert L, since we know that P lays on 
the spine between R and K, that means we'll have to merge P with K to 
produce another Bin. The spine from R to K is now garbage--- but it 
wouldn't have been allocated in the merge-based implementation, since K 
and L would have been inserted simultaneously.

If it's really a concern, then I still say the best approach is to just 
benchmark the two options; the API gives you the tools you need. As for 
the O(n) size, you can always define your own data structure which 
memoizes the size; e.g.,

     -- N.B., the size field is lazy and therefore only computed on
     -- demand, yet it is shared and so only costs O(n) the first
     -- time it's accessed since the map was updated.
     data SizedIntMap a = SIM Int !(IntMap a)

     size (SIM s _) = s

     insert k v (SIM _ m) = SIM (IM.size m') m'
         where m' = IM.insert k v m

     ...

It's a lot of boilerplate, and it's be nice if that boilerplate were 
provided once and for all by containers (as Data.IntMap.Sized, or the 
like), but it's simple enough to do.

-- 
Live well,
~wren