Data.HashTable.hashInt seems somewhat sub-optimal

Tue Aug 28 11:41:22 EDT 2007

Sigh.

I've included a fix below, but I don't have a standard GHC checkout  
available to me (only the test sandbox I used to get the code in the  
first place) so I haven't generated a patch.  Note the mentioned  
"reasonability tests", which hopefully should forestall such obvious  
infelicities in future.

On Aug 26, 2007, at 1:42 PM, Thorkil Naur wrote:

> Hello,
>
> On Monday 20 August 2007 13:15, Ian Lynagh wrote:
>> ...
>> I'm also suspicious of this, though:
>>
>>     -- | A sample hash function for Strings.  We keep multiplying  
>> by the
>>     -- golden ratio and adding.
>>     --
>>     -- Note that this has not been extensively tested for  
>> reasonability,
>>     -- but Knuth argues that repeated multiplication by the golden  
>> ratio
>>     -- will minimize gaps in the hash space.
>>     hashString :: String -> Int32
>>     hashString = foldl' f 0
>>       where f m c = fromIntegral (ord c + 1) * golden + mulHi m  
>> golden
>>
>> should this be
>>
>>     where f m c = (fromIntegral (ord c + 1) + m) * golden
>>
>> ? Does Knuth (TAOCP?) say?
>
> In the 2nd edition of Knuth's The Art of Computer Programming, Vol  
> 3, Sorting
> and Searching there is a discussion of hash functions on pp.  
> 514-520. One of
> the techniques suggested for hashing a one-word (i.e. essentially  
> fixed-size)
> key is the following multiplicative scheme:
>
>   h(K) = floor ( M*(((A/w)*K)) mod 1) )
>
> where w is the word-size (say, 2^32), M is the desired limit of the  
> hash
> function (for efficiency, probably a suitable power of 2) and,  
> finally, A is
> some integer constant. What happens here is that we consider the  
> (word) K as
> a fraction with the binary point at the left end of the word rather  
> than at
> the right, thus getting a fraction with a value between 0 and 1.  
> This value
> we then multiply by A and cut off the integer part, once again  
> getting a
> fractional value between 0 and 1. And finally, we multiply by M and  
> cut away
> the fractional part to get an integer value between 0 and M-1. And,  
> sure,
> Knuth suggests various variants of selecting the multiplier A  
> related to the
> golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading  
> of hashes
> for keys in arithmetic progressions. (K, K+d, K+2d, ...).

In the fix below I ended up using twice the golden ratio (a value of  
A greater than one).  The inverse of the golden ratio (which is 1 +  
golden) didn't work well at all.

> But what we are dealing with in the hashString function is what  
> Knuth would
> call a multiword or variable-length key. Such cases, Knuth  
> suggests, "can be
> handled by multiple-precision extensions of [e.g. the  
> multiplicative scheme]
> above, but it is generally adequate to speed things up by combining  
> the
> individual words together into a single word, then doing a single
> multiplication ... as above."

But combining things into a single word requires having a good  
combining mechanism, which is hard in general---particularly with a  
type like Char that appears to have a large range but in practice  
only has a very small one.

> Neither of the above definitions of f implement a multiple- 
> precision extension
> of the multiplicative hashing scheme that involves the golden ration.

I was disinclined to compute the golden ratio out to a number of  
digits suitable to combine together all the list elements, and I'm  
skeptical it'd actually work well if I did (I think we'd lose most of  
the information from the low-order bits, actually).  But I'd invite  
someone to give it a try.

> And
> none of the methods suggested by Knuth for combining multiple words  
> into
> single words or otherwise compute hashes for multiword keys involve  
> the
> golden ration.

He suggests adding or xoring the elements of a string together before  
hashing.  In practice this has a bunch of known terrible failure  
modes.  We rather particularly don't want to use a commutative  
combining operator, or (as he notes) "XY" and "YX" will have the same  
hash.  I tried various variations on string hashing before arriving  
at the one below (which is not too different from the old one, but  
contains an unrelated pre-multiplier).

I actually did do a bakeoff between multiple hashing schemes, and  
have a version of Data.HashTable that requires separate import of a  
hashing technique.  The multiplicative hash worked better in the  
bakeoff than either the modulus hash used in older versions of the  
library, or a couple of versions of the Bob Jenkins hash.

Could I convince someone with a checked out repository to cut and  
paste this into Data.HashTable and generate a patch / check it in?

Thanks,

-Jan

Here's the fix:

--  
------------------------------------------------------------------------ 
-----
-- Sample hash functions

-- $hash_functions
--
-- This implementation of hash tables uses the low-order /n/ bits of  
the hash
-- value for a key, where /n/ varies as the hash table grows.  A good  
hash
-- function therefore will give an even distribution regardless of /n/.
--
-- If your keyspace is integrals such that the low-order bits between
-- keys are highly variable, then you could get away with using  
'fromIntegral'
-- as the hash function.
--
-- We provide some sample hash functions for 'Int' and 'String' below.

golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
-- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
-- but that has bad mulHi properties (even adding 2^32 to get its  
inverse)
-- Whereas the above works well and contains no hash duplications for
-- [-32767..65536]

hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x

-- | A sample (and useful) hash function for Int and Int32,
-- implemented by extracting the uppermost 32 bits of the 64-bit
-- result of multiplying by a 33-bit constant.  The constant is from
-- Knuth, derived from the golden ratio:
-- > golden = round ((sqrt 5 - 1) * 2^32)
-- We get good key uniqueness on small inputs
-- (a problem with previous versions):
--  (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 +  
32768
--
hashInt :: Int -> Int32
hashInt x = hashInt32 (fromIntegral x)

-- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply
mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
   where r :: Int64
         r = fromIntegral a * fromIntegral b

-- | A sample hash function for Strings.  We keep multiplying by the
-- golden ratio and adding.  The implementation is:
--
-- > hashString = foldl' f golden
-- >   where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
-- >         magic = 0xdeadbeef
--
-- Where hashInt32 works just as hashInt shown above.
--
-- Knuth argues that repeated multiplication by the golden ratio
-- will minimize gaps in the hash space, and thus it's a good choice
-- for combining together multiple keys to form one.
--
-- Here we know that individual characters c are often small, and this
-- produces frequent collisions if we use fromEnum c alone.  A
-- particular problem are the shorter low ASCII and ISO-8859-1
-- character strings.  We pre-multiply by a magic twiddle factor to
-- obtain a good distribution.  In fact, given the following test:
--
-- > testp :: Int32 -> Int
-- > testp k = (n - ) . length . group . sort . map hs . take n $ ls
-- >   where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']]
-- >         hs = foldl' f golden
-- >         f m c = fromIntegral (fromEnum c) * k + hashInt32 m
-- >         n = 100000
--
-- We discover that testp magic = 0.

hashString :: String -> Int32
hashString = foldl' f golden
   where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
         magic = 0xdeadbeef