Data.HashTable.hashInt seems somewhat sub-optimal

Jan-Willem Maessen jmaessen at alum.mit.edu
Wed Aug 29 20:08:51 EDT 2007


On Aug 29, 2007, at 6:38 PM, Ian Lynagh wrote:

>
> Hi,
>
> On Tue, Aug 28, 2007 at 11:41:22AM -0400, Jan-Willem Maessen wrote:
>>
>> golden :: Int32
>> golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
>> -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
>> -- but that has bad mulHi properties (even adding 2^32 to get its   
>> inverse)
>> -- Whereas the above works well and contains no hash duplications for
>> -- [-32767..65536]
>>
>> hashInt32 :: Int32 -> Int32
>> hashInt32 x = mulHi x golden + x
>
> This gives
>
>> map hashInt [0..16]
> [0,1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19]
>
>> --  (length $ group $ sort $ map hashInt [-32767..65536]) == 65536  
>> +  
>> 32768
>
> This test also passes for the
>
>     golden :: Int32
>     golden = -1640531527
>
>     hashInt :: Int -> Int32
>     hashInt x = fromIntegral x * golden
>
> implementation, which has a very pretty distribution; graph at the
> bottom of
>     http://www.brpreiss.com/books/opus4/html/page214.html

Recall that we're using the low-order bits of the hash code to index  
into the table.  If the keys are always, say, multiples of 8 then the  
hash codes will always be multiples of 8 as well.  We usually  
compensate in this case by using the *high-order* bits of the hash  
code for hash table indexing.

I seem to recall considering this, but discovering that there were  
naive hash functions floating about somewhere that meant it would be  
a bad idea in practice (only the low bits would contain any data, and  
we'd hash everything to 0).

I admit that I didn't think about just post-multiplying the result of  
the passed-in hash function by golden before looking at high bits.   
That of course results in gratuitous work if we've actually taken  
pains to design a decent hash function.

>> hashString :: String -> Int32
>> hashString = foldl' f golden
>>   where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
>>         magic = 0xdeadbeef
>
> Why use magic rather than golden?

It didn't work as well, that's all:

*Data.HashTable.Multiplicative> testp golden
969
*Data.HashTable.Multiplicative> testp 0xdeadbeef
0

Having had the "hash table superstitions" conversation with a  
colleague several times, my hypothesis is that we want to choose  
unrelated multipliers in this case.  Trying testp restricted to a  
modulus of 2^18 shows that it's probably a bit of a wash in practice.


> This makes sense to me:
>
>     hashString :: String -> Int32
>     hashString = foldl' f golden
>       where f m c = (fromIntegral (ord c) `xor` m) * golden
>
> Is anything obviously wrong with it?

Make that xor a + and it seems to work fine (carry chains are your  
friends), yielding fewer collisions in the 2^18 case.  I'm much less  
worried that all your characters are multiples of 2^k.  I'm assuming  
you mean (-1640531527) for golden.

I haven't actually tried performance bakeoffs for any of these.   
Perhaps the original poster has an actual application in mind where  
the actual difference could be observed?  That was what prompted my  
original revisions to the library in the first place.  My instinct is  
that mulHi should actually be cheaper than it appears on most  
architectures, and so the computational cost should be pretty  
comparable, but your hashString is obviously cheaper to evaluate.

-Jan





More information about the Libraries mailing list