RFC: Adding a Hashable type class and HashMap/HashSet data types
nominolo at googlemail.com
Thu Nov 25 11:15:49 EST 2010
On 25 November 2010 13:10, Yitzchak Gale <gale at sefer.org> wrote:
> I wrote:
>>> More interesting are Maybe, Either, lists, tuples, Data.Map, Data.Set,
>>> etc. These are far from trivial; I did some work on it for Python. If you
>>> think you can just do something fairly simple and reasonable and hope
>>> that it will have the right statistical properties, I promise you that it won't
> Thomas Schilling wrote:
>> Are you referring to the hash function or to the fact that you want
>> structurally different types to map to different hashes?
> The former. In Python, where objects of varying types will be inserted
> into the same hash table, the latter is also important. But why should
> we care about that in Haskell? If people circumvent the type system
> by serializing before hashing, I don't think we can help them very much.
> Or perhaps I am misunderstanding you.
Right, murmurhash2 is a very good and well-tested hash function.
There is currently a beta version version 3 because a weakness has
been found for some cases ("repetitions of 4-byte patterns have a much
higher chance of collision than expected"). I think if we base our
hash function on Murmur3, we don't need to worry much about the rest.
Regarding hashing of ADTs, I agree with you that hash () == hash False
should be no problem. The general principles would be:
1. prop_hashEq x y = x == y ==> hash x == hash y
and for a data type with constructors C_1 .. C_n
2. hash (C_i x_i1 ... x_iN) = hash i <> hash x_i1 <> ... <> hash x_iN
where <> is the hash combine function (probably (.))
Principle (1) may require to use some form of normalisation before
applying principle (2), e.g., for Set, differently balanced trees
should still produce the same hash.
Does this look reasonable?
Also, I'm fine with the return of hash being an architecture-specific
Word type, but we also need variants with a known bit-width of the
resulting hash. For example, a Word32 hash has way more collisions
than a Word64 hash, and this is important for some applications.
Push the envelope. Watch it bend.
More information about the Libraries