[Haskell-cafe] Data.IntMap union complexity
wren ng thornton
wren at freegeek.org
Sun Feb 26 02:28:42 CET 2012
On 2/24/12 3:40 AM, Christoph Breitkopf wrote:
> On Fri, Feb 24, 2012 at 4:48 AM, wren ng thornton<wren at freegeek.org> wrote:
>> When the two maps are of vastly different sizes, O(min(m,n)) is a more
>> intuitive way to think about it. Since you only have to look at the places
>> where the spines clash, that will be on the order of the smaller map.
> Folding insert might still be a win if one of the maps is very much
> than the other, but since size is O(n) for Data.IntMap, there's no way to
> find out if that's the case.
It's possible (due to the constant factors I keep mentioning), though I
wouldn't expect it to. The reasoning why is as follows:
In the smaller map every key will have some shared prefix with other
keys. (Excepting degenerate cases like the empty map and singleton
maps.) As such, calling insert with two keys that share a prefix means
that you'll be traversing the larger map's spine for that prefix twice.
Whereas, with the merge-based implementation we only traverse once.
Moreover, when performing repeated insertions, we have to reconstruct
the structure of the smaller map, which can lead to producing excessive
garbage over the merge-based implementation. For example, consider the
case where we have keys K and L in the smaller map which diverge at some
prefix P. If P does not lay in the spine of the larger map, then we will
have to merge P with some other prefix Q in order to produce a Bin.
Let's say that P and Q diverge at R (thus there is a Bin at R, with
children P and Q). After inserting K we now have a map where there is a
spine from R to K. Now, when we insert L, since we know that P lays on
the spine between R and K, that means we'll have to merge P with K to
produce another Bin. The spine from R to K is now garbage--- but it
wouldn't have been allocated in the merge-based implementation, since K
and L would have been inserted simultaneously.
If it's really a concern, then I still say the best approach is to just
benchmark the two options; the API gives you the tools you need. As for
the O(n) size, you can always define your own data structure which
memoizes the size; e.g.,
-- N.B., the size field is lazy and therefore only computed on
-- demand, yet it is shared and so only costs O(n) the first
-- time it's accessed since the map was updated.
data SizedIntMap a = SIM Int !(IntMap a)
size (SIM s _) = s
insert k v (SIM _ m) = SIM (IM.size m') m'
where m' = IM.insert k v m
It's a lot of boilerplate, and it's be nice if that boilerplate were
provided once and for all by containers (as Data.IntMap.Sized, or the
like), but it's simple enough to do.
More information about the Haskell-Cafe