Cryptographic hash uniquness (was [Haskell-cafe] Simple network client)

Fri Feb 1 08:15:54 EST 2008

Bulat Ziganshin wrote:
> Hello Peter,
> 
> Thursday, January 31, 2008, 8:01:36 PM, you wrote:
> 
>> files with different content generating the same hash)... My
>> intuition told me that the odds of two cryptographic hashes (on
>> meaningful content) colliding was much less than the earth being
>> destroyed by an asteroid... But this is just intuition... What does
>> computer science tell us about this?
> 
> you may be interested to know that widely used rsync algorithms relies
> on 128-bit hashes and its author speculated about its reliability:
> 
> http://samba.org/~tridge/phd_thesis.pdf
> 

Interesting paper. Thank you. I had a quick read of the bit relating to
hashes and my understanding is this. He uses a weak, quick and simple
hash to deal with 99.99% (I invented that percentage) of cases - if the
hash is different we know the files are definitely different - if the
hash collides he then does a strong, slow, secure hash and relies on
this as the answer. But he's relying on the strong hash rather than
doing a byte by byte comparison because there is a major cost (a network
transmission of the file) to doing the proper byte by byte comparison.
If you had both files accessible at a low cost it might be better to
byte by byte compare them when you get a collision rather than use
the strong hash.

The right approach may be to assume that collisions will occur and
cater for this properly in the program somehow. A good tip for testing
this sort of thing is to have the length of the hash (maximum size of
the array or whatever you want to test) as a parameter that you can
turn down to a very low value to induce collisions (overflows etc) to
see whether the program still works. And then turn it back up for live
use.

A cryptographic hash appears as a completely random function of the
input so the likelihood of a collision is simply 2^(bits in hash).

Richard.