Data.HashTable.hashInt seems somewhat sub-optimal
Jan-Willem Maessen
jmaessen at alum.mit.edu
Tue Aug 28 11:41:22 EDT 2007
Sigh.
I've included a fix below, but I don't have a standard GHC checkout
available to me (only the test sandbox I used to get the code in the
first place) so I haven't generated a patch. Note the mentioned
"reasonability tests", which hopefully should forestall such obvious
infelicities in future.
On Aug 26, 2007, at 1:42 PM, Thorkil Naur wrote:
> Hello,
>
> On Monday 20 August 2007 13:15, Ian Lynagh wrote:
>> ...
>> I'm also suspicious of this, though:
>>
>> -- | A sample hash function for Strings. We keep multiplying
>> by the
>> -- golden ratio and adding.
>> --
>> -- Note that this has not been extensively tested for
>> reasonability,
>> -- but Knuth argues that repeated multiplication by the golden
>> ratio
>> -- will minimize gaps in the hash space.
>> hashString :: String -> Int32
>> hashString = foldl' f 0
>> where f m c = fromIntegral (ord c + 1) * golden + mulHi m
>> golden
>>
>> should this be
>>
>> where f m c = (fromIntegral (ord c + 1) + m) * golden
>>
>> ? Does Knuth (TAOCP?) say?
>
> In the 2nd edition of Knuth's The Art of Computer Programming, Vol
> 3, Sorting
> and Searching there is a discussion of hash functions on pp.
> 514-520. One of
> the techniques suggested for hashing a one-word (i.e. essentially
> fixed-size)
> key is the following multiplicative scheme:
>
> h(K) = floor ( M*(((A/w)*K)) mod 1) )
>
> where w is the word-size (say, 2^32), M is the desired limit of the
> hash
> function (for efficiency, probably a suitable power of 2) and,
> finally, A is
> some integer constant. What happens here is that we consider the
> (word) K as
> a fraction with the binary point at the left end of the word rather
> than at
> the right, thus getting a fraction with a value between 0 and 1.
> This value
> we then multiply by A and cut off the integer part, once again
> getting a
> fractional value between 0 and 1. And finally, we multiply by M and
> cut away
> the fractional part to get an integer value between 0 and M-1. And,
> sure,
> Knuth suggests various variants of selecting the multiplier A
> related to the
> golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading
> of hashes
> for keys in arithmetic progressions. (K, K+d, K+2d, ...).
In the fix below I ended up using twice the golden ratio (a value of
A greater than one). The inverse of the golden ratio (which is 1 +
golden) didn't work well at all.
> But what we are dealing with in the hashString function is what
> Knuth would
> call a multiword or variable-length key. Such cases, Knuth
> suggests, "can be
> handled by multiple-precision extensions of [e.g. the
> multiplicative scheme]
> above, but it is generally adequate to speed things up by combining
> the
> individual words together into a single word, then doing a single
> multiplication ... as above."
But combining things into a single word requires having a good
combining mechanism, which is hard in general---particularly with a
type like Char that appears to have a large range but in practice
only has a very small one.
> Neither of the above definitions of f implement a multiple-
> precision extension
> of the multiplicative hashing scheme that involves the golden ration.
I was disinclined to compute the golden ratio out to a number of
digits suitable to combine together all the list elements, and I'm
skeptical it'd actually work well if I did (I think we'd lose most of
the information from the low-order bits, actually). But I'd invite
someone to give it a try.
> And
> none of the methods suggested by Knuth for combining multiple words
> into
> single words or otherwise compute hashes for multiword keys involve
> the
> golden ration.
He suggests adding or xoring the elements of a string together before
hashing. In practice this has a bunch of known terrible failure
modes. We rather particularly don't want to use a commutative
combining operator, or (as he notes) "XY" and "YX" will have the same
hash. I tried various variations on string hashing before arriving
at the one below (which is not too different from the old one, but
contains an unrelated pre-multiplier).
I actually did do a bakeoff between multiple hashing schemes, and
have a version of Data.HashTable that requires separate import of a
hashing technique. The multiplicative hash worked better in the
bakeoff than either the modulus hash used in older versions of the
library, or a couple of versions of the Bob Jenkins hash.
Could I convince someone with a checked out repository to cut and
paste this into Data.HashTable and generate a patch / check it in?
Thanks,
-Jan
Here's the fix:
--
------------------------------------------------------------------------
-----
-- Sample hash functions
-- $hash_functions
--
-- This implementation of hash tables uses the low-order /n/ bits of
the hash
-- value for a key, where /n/ varies as the hash table grows. A good
hash
-- function therefore will give an even distribution regardless of /n/.
--
-- If your keyspace is integrals such that the low-order bits between
-- keys are highly variable, then you could get away with using
'fromIntegral'
-- as the hash function.
--
-- We provide some sample hash functions for 'Int' and 'String' below.
golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
-- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
-- but that has bad mulHi properties (even adding 2^32 to get its
inverse)
-- Whereas the above works well and contains no hash duplications for
-- [-32767..65536]
hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x
-- | A sample (and useful) hash function for Int and Int32,
-- implemented by extracting the uppermost 32 bits of the 64-bit
-- result of multiplying by a 33-bit constant. The constant is from
-- Knuth, derived from the golden ratio:
-- > golden = round ((sqrt 5 - 1) * 2^32)
-- We get good key uniqueness on small inputs
-- (a problem with previous versions):
-- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536 +
32768
--
hashInt :: Int -> Int32
hashInt x = hashInt32 (fromIntegral x)
-- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply
mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
where r :: Int64
r = fromIntegral a * fromIntegral b
-- | A sample hash function for Strings. We keep multiplying by the
-- golden ratio and adding. The implementation is:
--
-- > hashString = foldl' f golden
-- > where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
-- > magic = 0xdeadbeef
--
-- Where hashInt32 works just as hashInt shown above.
--
-- Knuth argues that repeated multiplication by the golden ratio
-- will minimize gaps in the hash space, and thus it's a good choice
-- for combining together multiple keys to form one.
--
-- Here we know that individual characters c are often small, and this
-- produces frequent collisions if we use fromEnum c alone. A
-- particular problem are the shorter low ASCII and ISO-8859-1
-- character strings. We pre-multiply by a magic twiddle factor to
-- obtain a good distribution. In fact, given the following test:
--
-- > testp :: Int32 -> Int
-- > testp k = (n - ) . length . group . sort . map hs . take n $ ls
-- > where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']]
-- > hs = foldl' f golden
-- > f m c = fromIntegral (fromEnum c) * k + hashInt32 m
-- > n = 100000
--
-- We discover that testp magic = 0.
hashString :: String -> Int32
hashString = foldl' f golden
where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
magic = 0xdeadbeef
More information about the Libraries
mailing list