Data.HashTable.hashInt seems somewhat sub-optimal
Jan-Willem Maessen
jmaessen at alum.mit.edu
Thu Aug 30 13:14:53 EDT 2007
On Aug 30, 2007, at 9:56 AM, <Simon.Frankau at barclayscapital.com> wrote:
> Jan-Willem Maessen wrote:
>> Sigh.
>>
>> I've included a fix below, but I don't have a standard GHC checkout
>> available to me (only the test sandbox I used to get the code in the
>> first place) so I haven't generated a patch. Note the mentioned
>> "reasonability tests", which hopefully should forestall such obvious
>> infelicities in future.
>
> Hi.
>
> Sorry to be a bit negative, but I don't think the hash function you
> suggest is good for a number of reasons:
>
> 1) I don't think you're implementing the algorithm Thorkil Naur
> suggests. I believe (although I haven't had the time to check) that
> the
> scheme is equivalent to using the output of mulLo, not mulHi. So, we
> lose whatever theoretical underpinnings the original has.
Thorkil's scheme (from Knuth) used the higher-order bits. As I noted
in my previous email, we're using the low-order bits to index the
hash table. This was forced on me by legacy hash functions.
A compromise position here (mentioned in my last mail) might be to
post-multiply any incoming hash, and then pick off the high-order
bits. This penalizes programmers who take the time to write a decent
hash function for their data type.
> 2) Signed multiplication is being used. I think unsigned
> multiplication
> is probably necessary to get the intended behaviour. For example, with
> unsigned multiplication, multiplying by the original 'golden' and
> taking
> the the top 32 bits is equivalent to multiplying by the golden ratio,
> while with signed multiplication it is equivalent to multiplying by
> the
> golden ratio minus one.
Actually, I've simply used the bits directly. Signed twos-complement
multiplication is the same as unsigned twos-complement multiplication
(with all the unintuitive overflow behavior that entails). viz:
> Numeric.showHex ((0x80000001 :: Int32) * 2) ""
"2"
> Numeric.showHex ((0xa0000000 :: Int32) * 2) ""
"40000000"
> 3) The original code was equivalent to a multiplication by a fixed
> point
> number less than one.
Yes, and the low order bits of the hash are computed based only on
the low order bits of the input data.
> The new code is equivalent to multiplying by a
> fixed point number slightly greater than one. I think this misses an
> important behaviour which hash functions should have: They should
> smear
> out ranges of integers.
Let's keep the intended application in mind: we're going to be taking
low-order bits from the hashed data. We only need enough smearing to
guarantee that this happens. That said, I'll quickly agree that we'd
like more smearing than that if we can get it.
> Why? People tend to hash lists with a fold
> which, for example, hashes the previous result, and then xors in or
> adds
> the new value.
This is in general a rotten way to hash a list of things, which is
why presentations of hashing don't use it for hashing strings.
Interestingly Knuth does not cover this topic in detail and mentions
the use of addition or xor offhandedly.
> If the hashes of small integers are small integers, we
> have the situation where transposing two elements of a list of small
> integers, and making a minor adjustment to one of those values will
> result in a clash. In other words, we're not making effective use
> of the
> available range of numbers. Hence, for example, the use of 'magic' in
> hashString.
We effectively have a choice here:
1) Use a decent hash for individual elements, and combine them in
a simple way (eg sum)
2) Use a decent method for combining elements
I've tried to use (2). Indeed, if I were designing the library from
scratch I'd include a function for combining hashes together.
> I understand your concerns that
>
> A) The hash function should be cheap to evaluate, and need not be
> particularly strong.
>
> However, It should not be unnecessarily weak. A decent hash should
> allow
> small sets of values (e.g. short lists of ints) to hash to unique
> values. If they do not, scaling falls apart, since rehashing with
> larger
> table sizes will not necessarily shorten the longer hash chains. I
> think
> we can find a hash function with better behaviour which is still cheap
> to evaluate.
I think I've demonstrated that they do. (Actually, I tried an
additional test last night in which I had testp draw only from
['a'..'z'], and that worked well too, especially with Ian's
simplified function using + instead of xor.) Perhaps you disagree?
But I'm happy to entertain suggestions of cheap hash functions.
Recall that you can create a hash table with any hash function you
like; testing better hash functions with Data.HashTable is really
pretty easy and doesn't require the source code to the library at all!
> B) If the low bits are zero, it would be nice if the low bits of the
> hash weren't always zero.
>
> A scheme using mulLo doesn't have good behaviour here. Possible
> variations which are quick to evaluate and have this property are:
> * xor/add together the output of mulHi and mulLo
This is worth a try, since we're (sort of) doing the work anyhow. [I
say sort of because on many architectures mulHi and mulLo are
independent instructions, though I believe on x86 one instruction
does both.]
> * Put a rotation step in.
Rotation, of course, just moves the problem elsewhere.
> The downside to these is that these 'adjustments' could well break
> neat
> behaviour seen in the theory. I'm afraid I don't have the time to test
> these alternatives.
>
> So, I'm sorry I don't really have anything particularly
> constructive to
> say, but I thought it still worth mentioning what I think are
> weaknesses
> in the scheme.
I agree that it's not without flaws (no cheap hash is, and the high-
order zeros are definitely a concern). But I'll reiterate that I
believe it's pretty good *for the intended application* while
agreeing that *something better probably exists.*
-Jan
>
> Thanks,
> Simon Frankau.
>
>> On Aug 26, 2007, at 1:42 PM, Thorkil Naur wrote:
>>
>>> Hello,
>> [snip]
>>> In the 2nd edition of Knuth's The Art of Computer Programming, Vol
>>> 3, Sorting
>>> and Searching there is a discussion of hash functions on pp.
>>> 514-520. One of
>>> the techniques suggested for hashing a one-word (i.e. essentially
>>> fixed-size)
>>> key is the following multiplicative scheme:
>>>
>>> h(K) = floor ( M*(((A/w)*K)) mod 1) )
>>>
>>> where w is the word-size (say, 2^32), M is the desired limit of the
>>> hash
>>> function (for efficiency, probably a suitable power of 2) and,
>>> finally, A is
>>> some integer constant. What happens here is that we consider the
>>> (word) K as
>>> a fraction with the binary point at the left end of the word rather
>
>>> than at
>>> the right, thus getting a fraction with a value between 0 and 1.
>>> This value
>>> we then multiply by A and cut off the integer part, once again
>>> getting a
>>> fractional value between 0 and 1. And finally, we multiply by M and
>
>>> cut away
>>> the fractional part to get an integer value between 0 and M-1. And,
>
>>> sure,
>>> Knuth suggests various variants of selecting the multiplier A
>>> related to the
>>> golden ratio (sqrt(5)-1)/2 = 0.6180... to gain suitable spreading
>>> of hashes
>>> for keys in arithmetic progressions. (K, K+d, K+2d, ...).
>>
>> In the fix below I ended up using twice the golden ratio (a value of
>> A greater than one). The inverse of the golden ratio (which is 1 +
>> golden) didn't work well at all.
>>
>>> But what we are dealing with in the hashString function is what
>>> Knuth would
>>> call a multiword or variable-length key. Such cases, Knuth
>>> suggests, "can be
>>> handled by multiple-precision extensions of [e.g. the
>>> multiplicative scheme]
>>> above, but it is generally adequate to speed things up by combining
>
>>> the
>>> individual words together into a single word, then doing a single
>>> multiplication ... as above."
>>
>> But combining things into a single word requires having a good
>> combining mechanism, which is hard in general---particularly with a
>> type like Char that appears to have a large range but in practice
>> only has a very small one.
>>
>>> Neither of the above definitions of f implement a multiple-
>>> precision extension
>>> of the multiplicative hashing scheme that involves the golden
> ration.
>>
>> [snip]
>> I actually did do a bakeoff between multiple hashing schemes, and
>> have a version of Data.HashTable that requires separate import of a
>> hashing technique. The multiplicative hash worked better in the
>> bakeoff than either the modulus hash used in older versions of the
>> library, or a couple of versions of the Bob Jenkins hash.
>>
>> Could I convince someone with a checked out repository to cut and
>> paste this into Data.HashTable and generate a patch / check it in?
>>
>> Thanks,
>>
>> -Jan
>>
>>
>> Here's the fix:
>>
>>
>> --
>>
> ----------------------------------------------------------------------
> --
>
>> -----
>> -- Sample hash functions
>>
>> -- $hash_functions
>> --
>> -- This implementation of hash tables uses the low-order /n/ bits
>> of
>> the hash
>> -- value for a key, where /n/ varies as the hash table grows. A good
>
>> hash
>> -- function therefore will give an even distribution regardless of
> /n/.
>> --
>> -- If your keyspace is integrals such that the low-order bits between
>> -- keys are highly variable, then you could get away with using
>> 'fromIntegral'
>> -- as the hash function.
>> --
>> -- We provide some sample hash functions for 'Int' and 'String'
>> below.
>>
>> golden :: Int32
>> golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
>> -- was -1640531527 = round ((sqrt 5 - 1) * 2^31) :: Int32
>> -- but that has bad mulHi properties (even adding 2^32 to get its
>> inverse)
>> -- Whereas the above works well and contains no hash duplications for
>> -- [-32767..65536]
>>
>> hashInt32 :: Int32 -> Int32
>> hashInt32 x = mulHi x golden + x
>>
>> -- | A sample (and useful) hash function for Int and Int32,
>> -- implemented by extracting the uppermost 32 bits of the 64-bit
>> -- result of multiplying by a 33-bit constant. The constant is from
>> -- Knuth, derived from the golden ratio:
>> -- > golden = round ((sqrt 5 - 1) * 2^32)
>> -- We get good key uniqueness on small inputs
>> -- (a problem with previous versions):
>> -- (length $ group $ sort $ map hashInt [-32767..65536]) == 65536
>> +
>> 32768
>> --
>> hashInt :: Int -> Int32
>> hashInt x = hashInt32 (fromIntegral x)
>>
>> -- hi 32 bits of a x-bit * 32 bit -> 64-bit multiply
>> mulHi :: Int32 -> Int32 -> Int32
>> mulHi a b = fromIntegral (r `shiftR` 32)
>> where r :: Int64
>> r = fromIntegral a * fromIntegral b
>>
>> -- | A sample hash function for Strings. We keep multiplying by the
>> -- golden ratio and adding. The implementation is:
>> --
>> -- > hashString = foldl' f golden
>> -- > where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
>> -- > magic = 0xdeadbeef
>> --
>> -- Where hashInt32 works just as hashInt shown above.
>> --
>> -- Knuth argues that repeated multiplication by the golden ratio
>> -- will minimize gaps in the hash space, and thus it's a good choice
>> -- for combining together multiple keys to form one.
>> --
>> -- Here we know that individual characters c are often small, and
>> this
>> -- produces frequent collisions if we use fromEnum c alone. A
>> -- particular problem are the shorter low ASCII and ISO-8859-1
>> -- character strings. We pre-multiply by a magic twiddle factor to
>> -- obtain a good distribution. In fact, given the following test:
>> --
>> -- > testp :: Int32 -> Int
>> -- > testp k = (n - ) . length . group . sort . map hs . take n $ ls
>> -- > where ls = [] : [c : l | l <- ls, c <- ['\0'..'\xff']]
>> -- > hs = foldl' f golden
>> -- > f m c = fromIntegral (fromEnum c) * k + hashInt32 m
>> -- > n = 100000
>> --
>> -- We discover that testp magic = 0.
>>
>> hashString :: String -> Int32
>> hashString = foldl' f golden
>> where f m c = fromIntegral (fromEnum c) * magic + hashInt32 m
>> magic = 0xdeadbeef
> ----------------------------------------------------------------------
> --
> For important statutory and regulatory disclosures and more
> information about Barclays Capital, please visit our web site at
> http://www.barcap.com.
>
> Internet communications are not secure and therefore the Barclays
> Group does not accept legal responsibility for the contents of this
> message. Although the Barclays Group operates anti-virus
> programmes, it does not accept responsibility for any damage
> whatsoever that is caused by viruses being passed. Any views or
> opinions presented are solely those of the author and do not
> necessarily represent those of the Barclays Group. Replies to this
> email may be monitored by the Barclays Group for operational or
> business reasons.
>
> Barclays Capital is the investment banking division of Barclays
> Bank PLC, a company registered in England (number 1026167) with its
> registered office at 1 Churchill Place, London, E14 5HP. This email
> may relate to or be sent from other members of the Barclays Group.
> ----------------------------------------------------------------------
> --
More information about the Libraries
mailing list