[Haskell] Fingerprints and hashing
Simon Peyton-Jones
simonpj at microsoft.com
Thu Oct 11 05:42:31 EDT 2007
Dear Haskellers
Here's a challenge.
We are all familiar with the idea of an MD5 checksum, which provides a reliable "fingerprint" for a file, usually 128 bits or so. If the file changes, the fingerprint is (almost) certain to do so too. There are lots of techniques: CRC, shar?, MD5, etc.
For various applications (including identifying common sub-expressions, and version tracking in GHC), I'd like a Haskell library that supports simple fingerprint operations.
Below I give an informal specification of such a library. I can think of any number of implementations, but I'd like one that is (a) robust and (b) efficient. By "robust" I meant that with, say, a 128-bit fingerprint the chance of accidental collision is truly low; a non-robust implementation uses the 128-bit space in a non-uniform way [trivial example: gets stuck on zero].
Any takers? I bet some of you have this already, in some form.
Simon
The module signature would be something like this:
data Fingerprint
instance Eq Fingerprint
instance Ord Fingerprint
initialPrint :: Fingerprint
class Fingerprintable t where
fingerPrint :: t -> Fingerprint -> FingerPrint
instance Fingerprintable Char
instance Fingerprintable Int
instance Fingerprintable Word32
-- etc
The fingerPrint operation here is expressed in the traditional way: an accumulator (say 128 bits) into which one stuffs successive values (e.g. a byte stream).
However, very important, I also want a more compositional form:
instance Fingerprintable Fingerprint
so that I can do something like
fp :: Expr -> Fingerprint
fp (App e1 e2) = fp e1 `fingerPrint` fp e2
Yes, I could thread the accumulator, thus:
fp' :: Expr -> Fingerprint -> Fingerprint
fp' (App e1 e2) p = fp' e1 (fp' e2 p)
but I can't *always* do that; or even if I can, I may already *have* a fingerprint for e1 in my hand, and it's painful to re-traverse e1. Think of hash-consing.
An obvious implementation of "instance Fingerprintable Fingerprint" is to divide one fingerprint into words, and feed it into the other via the Word32 instance. But it's not clear to me whether or not that is robust.
(Incidentally, one could define the class thus:
class Fingerprintable t where
fingerPrint :: t -> FingerPrint
plus
plusFP :: Fingerprint -> Fingerprint -> Fingerprint
But I think it'd be less efficient if that was the only way, because of all those now-compulsory plusFP operations.)
I'd like one other thing: to be able to trade cost for accuracy. I'd like a 128-bit fingerprint with essentially zero chance of having two distinct values that map to the same fingerprint; and a one-word fingerprint that was a lot more efficient, but where collisions would be entirely possible (again hash-consing is an example).
More information about the Haskell
mailing list