RFC: Adding a Hashable type class and HashMap/HashSet data types to HP

Yitzchak Gale gale at sefer.org
Mon Nov 29 05:25:40 EST 2010


I wrote:
>> Were you thinking that you could serialize those things and
>> then just use murmurhash on them?
>> Try it, but my guess is
>> it won't work well. Murmurhash is designed for strings, not for
>> those things. It's not a cryptographic hash; it's designed to
>> have good statistical properties for a certain very specific types
>> of input. If you abuse it, you'll likely lose those good properties.
>> The parameters for murmerhash were tweaked using keys that
>> were English words, and very short strings of random bytes.
>> There is no evidence it will work well with other kinds of keys.

Thomas Schilling wrote:
> Where did you get this from?

>From experience developing hashes for container
types.

>  The criteria listed at
> <http://code.google.com/p/smhasher/wiki/SMHasher> all talk about
> arbitrary bytes.  MurmurHash is designed for arbitrary byte strings,
> not English words.  It is not a cryptographic hash, sure, but I can't
> see any evidence that it only works with text.

Never mind their claims, look at their test cases.
What they tested ought to work well. What they didn't
test may or may not work well, and almost
certainly will behave badly at least some of the time.

Any fast non-cryptographic hash will by necessarily have
many kinds of input that will behave badly. The idea
is to test across a large sampling of the kinds of keys
that typify the input you expect and tweak the constants
in the hash to avoid too many collisions in those cases.

As far as I could see, they tested against English words,
and against fairly short strings of random bytes. And
for "random bytes", the technique they used to generate them
is important; changing to a different pseudorandom generator
sometimes changes the results significantly.

Furthermore, a fast serialization of containers is non-random,
so you will be exercising the hash in a specific way
that has not been tested. It could be it will be good enough, or
it could be you can tweak your serialization or add another
thin layer of hashing to make it all work.

Thanks,
Yitz


More information about the Libraries mailing list