enummapset merge into containers
wren ng thornton
wren at freegeek.org
Wed Feb 22 05:52:13 CET 2012
On 2/21/12 3:58 PM, Ben Gamari wrote:
> In light of all of this, it was suggested by Edward Kmett that perhaps a
> HashMap would make sense. In my understanding, he proposed that the cost
> of hashing a short string could be dwarfed by the potential cache misses
> of a deep tree lookup. I haven't tested this hypothesis, but it seems
> plausible. Do you think this trade-off could make sense?
I'm in a similar situation for a natural language processing project. In
my case I keep around a (HashMap,IntMap) bijection between strings and
their integerization, and then have something like EnumMap which is used
elsewhere since the Ints are newtyped. This approach has been more than
fast enough for my needs; i.e., the performance bottlenecks are
elsewhere so far as I can tell. I too often have to deal with pairs,
triples, etc, of IDs--- but I need to do so with a trie's API so, alas,
I can't use a HashMap for that part.
I don't think that giving out keys sequentially is the worst-case for
IntMap[1].
In any case, I've been meaning to post my project to Hackage for about a
year now. I was derailed a bit after I gave a talk about my EDSL for
smoothing probability distributions and Ed Kmett asked me to factor it
out into a separate package so he could use it. I've since broken the
project up into a number of smaller more-reusable packages--- which is
good in the long run, but I haven't had the time to put the pieces back
together in order to post it to Hackage. Other than the smoothing EDSL,
one of the packages is just for interning and integerizing strings.
Since you're doing the same thing, we should talk off-list about API
issues so that we can combine our work into a single package.
[1] At any given point all the allocated keys will form a compact
subspace of Int which all share the high-order bits. Since IntMap is a
big-endian trie with node-fusion, that means you'll have a complete tree
for the low order bits and only one comparison for the long spine of all
the high-order bits. The number of comparisons you need to make is
always O(B) where B is the number of bits that differ among the
subspace, but since the subspace is compact that means you're making the
most-efficient use of the differing bits and so B = O(log N) where N is
the number of allocated keys. In fact, this is one of the best cases for
IntMap. The worst case is if you hand out keys in an order which
maximizes B (e.g., handing out keys in an order like
00000,10000,01000,00100,00010,...), since that means B will be min(N,W)
where W is the word-size.
Any time you ensure that B = O(log N) it will be a best-case. For the
case of sequential keys this amounts to having a complete tree, but for
other cases it amounts to having an essentially complete tree ---where
by "essential" we mean that the incompleteness always comes in long
shared spines, such as the spine above the complete tree for the
sequentially ordered case. If you gave out keys in the order (map
bitwiseReverse [0..]) then you'd get an essentially-complete tree at the
top, with long spines coming off for each of the leaves. You could also
come up with other crazy orderings where you use, say, the top three and
bottom three bits but hold the middle bits fixed.
The greater the number of discontiguous regions of variance the worse
the performance will be, because even though we can still make it B =
O(log N) we're increasing the constant factor since we have to do a
comparison for each of the non-varying spines. Thus, [0..] or (map
bitwiseReverse [0..]) are the best since they have only one non-varying
spine. As described in the paper, Okasaki chose to use a big-endian trie
instead of a little-endian trie because the constant factors are better
for the [0..] case often used in projects that need to hand out unique
IDs like machine learning, natural language processing, and compilers.
--
Live well,
~wren
More information about the Libraries
mailing list