[Haskell] Fingerprints and hashing

Thu Oct 11 05:42:31 EDT 2007

Dear Haskellers

Here's a challenge.

We are all familiar with the idea of an MD5 checksum, which provides a reliable "fingerprint" for a file, usually 128 bits or so.  If the file changes, the fingerprint is (almost) certain to do so too.  There are lots of techniques: CRC, shar?, MD5, etc.

For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations.

Below I give an informal specification of such a library.  I can think of any number of implementations, but I'd like one that is (a) robust and (b) efficient.  By "robust" I meant that with, say, a 128-bit fingerprint the chance of accidental collision is truly low; a non-robust implementation uses the 128-bit space in a non-uniform way [trivial example: gets stuck on zero].

Any takers?  I bet some of you have this already, in some form.

Simon

The module signature would be something like this:

  data Fingerprint
  instance Eq Fingerprint
  instance Ord Fingerprint

  initialPrint :: Fingerprint

  class Fingerprintable t where
    fingerPrint :: t -> Fingerprint -> FingerPrint

  instance Fingerprintable Char
  instance Fingerprintable Int
  instance Fingerprintable Word32
  -- etc

The fingerPrint operation here is expressed in the traditional way: an accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte stream).

However, very important, I also want a more compositional form:

  instance Fingerprintable Fingerprint

so that I can do something like

  fp :: Expr -> Fingerprint
  fp (App e1 e2) = fp e1 `fingerPrint` fp e2

Yes, I could thread the accumulator, thus:

  fp' :: Expr -> Fingerprint -> Fingerprint
  fp' (App e1 e2) p = fp' e1 (fp' e2 p)

but I can't *always* do that; or even if I can, I may already *have* a fingerprint for e1 in my hand, and it's painful to re-traverse e1.  Think of hash-consing.

An obvious implementation of "instance Fingerprintable Fingerprint" is to divide one fingerprint into words, and feed it into the other via the Word32 instance.  But it's not clear to me whether or not that is robust.

(Incidentally, one could define the class thus:
  class Fingerprintable t where
    fingerPrint :: t -> FingerPrint
plus
  plusFP :: Fingerprint -> Fingerprint -> Fingerprint
But I think it'd be less efficient if that was the only way, because of all those now-compulsory plusFP operations.)

I'd like one other thing: to be able to trade cost for accuracy.  I'd like a 128-bit fingerprint with essentially zero chance of having two distinct values that map to the same fingerprint; and a one-word fingerprint that was a lot more efficient, but where collisions would be entirely possible (again hash-consing is an example).