Data.HashTable.hashInt seems somewhat sub-optimal
Jan-Willem Maessen
jmaessen at alum.mit.edu
Wed Aug 29 20:08:51 EDT 2007
On Aug 29, 2007, at 6:38 PM, Ian Lynagh wrote:
>
> Hi,
>
> On Tue, Aug 28, 2007 at 11:41:22AM -0400, Jan-Willem Maessen wrote:
>>
>> golden :: Int32
>> golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
>> -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
>> -- but that has bad mulHi properties (even adding 2^32 to get its
>> inverse)
>> -- Whereas the above works well and contains no hash duplications for
>> -- [-32767..65536]
>>
>> hashInt32 :: Int32 -> Int32
>> hashInt32 x = mulHi x golden + x
>
> This gives
>
>> map hashInt [0..16]
> [0,1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19]
>
>> -- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536
>> +
>> 32768
>
> This test also passes for the
>
> golden :: Int32
> golden = -1640531527
>
> hashInt :: Int -> Int32
> hashInt x = fromIntegral x * golden
>
> implementation, which has a very pretty distribution; graph at the
> bottom of
> http://www.brpreiss.com/books/opus4/html/page214.html
Recall that we're using the low-order bits of the hash code to index
into the table. If the keys are always, say, multiples of 8 then the
hash codes will always be multiples of 8 as well. We usually
compensate in this case by using the *high-order* bits of the hash
code for hash table indexing.
I seem to recall considering this, but discovering that there were
naive hash functions floating about somewhere that meant it would be
a bad idea in practice (only the low bits would contain any data, and
we'd hash everything to 0).
I admit that I didn't think about just post-multiplying the result of
the passed-in hash function by golden before looking at high bits.
That of course results in gratuitous work if we've actually taken
pains to design a decent hash function.
>> hashString :: String -> Int32
>> hashString = foldl' f golden
>> where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
>> magic = 0xdeadbeef
>
> Why use magic rather than golden?
It didn't work as well, that's all:
*Data.HashTable.Multiplicative> testp golden
969
*Data.HashTable.Multiplicative> testp 0xdeadbeef
0
Having had the "hash table superstitions" conversation with a
colleague several times, my hypothesis is that we want to choose
unrelated multipliers in this case. Trying testp restricted to a
modulus of 2^18 shows that it's probably a bit of a wash in practice.
> This makes sense to me:
>
> hashString :: String -> Int32
> hashString = foldl' f golden
> where f m c = (fromIntegral (ord c) `xor` m) * golden
>
> Is anything obviously wrong with it?
Make that xor a + and it seems to work fine (carry chains are your
friends), yielding fewer collisions in the 2^18 case. I'm much less
worried that all your characters are multiples of 2^k. I'm assuming
you mean (-1640531527) for golden.
I haven't actually tried performance bakeoffs for any of these.
Perhaps the original poster has an actual application in mind where
the actual difference could be observed? That was what prompted my
original revisions to the library in the first place. My instinct is
that mulHi should actually be cheaper than it appears on most
architectures, and so the computational cost should be pretty
comparable, but your hashString is obviously cheaper to evaluate.
-Jan
More information about the Libraries
mailing list