[Haskell] Fingerprints and hashing

Mon Oct 15 13:32:09 EDT 2007

Simon Peyton Jones writes:

 > We are all familiar with the idea of an MD5 checksum, which provides a
 > reliable "fingerprint" for a file, usually 128 bits or so...

 > For various applications (including identifying common sub-expressions, and
 > version tracking in GHC), I'd like a Haskell library that supports simple
 > fingerprint operations...
 > 
 > Below I give an informal specification of such a library.  I can think of
 > any number of implementations, but I'd like one that is (a) robust and (b)
 > efficient.  By "robust" I meant that with, say, a 128-bit fingerprint the
 > chance of accidental collision is truly low; a non-robust implementation
 > uses the 128-bit space in a non-uniform way [trivial example: gets stuck on
 > zero]...

 > The module signature would be something like this:
 > 
 >   data Fingerprint
 >   initialPrint :: Fingerprint
 > 
 >   class Fingerprintable t where
 >     fingerPrint :: t -> Fingerprint -> FingerPrint
 > 
 >   instance Fingerprintable Char
 >   instance Fingerprintable Int
 >   instance Fingerprintable Word32
 >   -- etc
 > 
 > The fingerPrint operation here is expressed in the traditional way: an
 > accumulator (say 128 bits) into which one stuffs successive values (e.g. a
 > byte stream).
 > 
 > However, very important, I also want a more compositional form:
 > 
 >   instance Fingerprintable Fingerprint
 > 
 > so that I can do something like
 > 
 >   fp :: Expr -> Fingerprint
 >   fp (App e1 e2) = fp e1 `fingerPrint` fp e2

[For Salil anad Michael: apply the fingerPrint function to a fingerprint]

 > An obvious implementation of "instance Fingerprintable Fingerprint" is to
 > divide one fingerprint into words, and feed it into the other via the
 > Word32 instance.  But it's not clear to me whether or not that is robust...
 > 
 > I'd like one other thing: to be able to trade cost for accuracy.  I'd like
 > a 128-bit fingerprint with essentially zero chance of having two distinct
 > values that map to the same fingerprint; and a one-word fingerprint that
 > was a lot more efficient, but where collisions would be entirely possible
 > (again hash-consing is an example).

Simon,

If I wanted this functionality, I'd call in the professionals (like
Salil Vadhan, Michael Rabin, and Andrei Broder, whom I have cc'ed on
this message).  You sent this mail to the Haskell list, but I think
the Haskell content of this problem is trivial compared to the
algorithmic problems you pose.  I remember, for example, a problem in
the distant past of SRC Modula-3, where fingerprinting did not behave
as expected because the source of the fingerprint module was itself
fingerprinted.

Here are some algorithmic references:

  http://tinyurl.com/yoeusg, http://tinyurl.com/2a2yfj
    - Modula-3 code for compositional 64-bit fingerprints,
      polynomial chosen by (literally) flipping a coin

  http://citeseer.ist.psu.edu/broder93some.html
    - Andrei's paper on applications of fingerprinting

How does laziness play into this problem?  If you don't mind building
up a lot of thunks, you can probably just use several fingerprinting
algorithms in parallel and then pull just the one you need...

Norman