[Haskell] Fingerprints and hashing
Norman Ramsey
nr at eecs.harvard.edu
Mon Oct 15 13:32:09 EDT 2007
Simon Peyton Jones writes:
> We are all familiar with the idea of an MD5 checksum, which provides a
> reliable "fingerprint" for a file, usually 128 bits or so...
> For various applications (including identifying common sub-expressions, and
> version tracking in GHC), I'd like a Haskell library that supports simple
> fingerprint operations...
>
> Below I give an informal specification of such a library. I can think of
> any number of implementations, but I'd like one that is (a) robust and (b)
> efficient. By "robust" I meant that with, say, a 128-bit fingerprint the
> chance of accidental collision is truly low; a non-robust implementation
> uses the 128-bit space in a non-uniform way [trivial example: gets stuck on
> zero]...
> The module signature would be something like this:
>
> data Fingerprint
> initialPrint :: Fingerprint
>
> class Fingerprintable t where
> fingerPrint :: t -> Fingerprint -> FingerPrint
>
> instance Fingerprintable Char
> instance Fingerprintable Int
> instance Fingerprintable Word32
> -- etc
>
> The fingerPrint operation here is expressed in the traditional way: an
> accumulator (say 128 bits) into which one stuffs successive values (e.g. a
> byte stream).
>
> However, very important, I also want a more compositional form:
>
> instance Fingerprintable Fingerprint
>
> so that I can do something like
>
> fp :: Expr -> Fingerprint
> fp (App e1 e2) = fp e1 `fingerPrint` fp e2
[For Salil anad Michael: apply the fingerPrint function to a fingerprint]
> An obvious implementation of "instance Fingerprintable Fingerprint" is to
> divide one fingerprint into words, and feed it into the other via the
> Word32 instance. But it's not clear to me whether or not that is robust...
>
> I'd like one other thing: to be able to trade cost for accuracy. I'd like
> a 128-bit fingerprint with essentially zero chance of having two distinct
> values that map to the same fingerprint; and a one-word fingerprint that
> was a lot more efficient, but where collisions would be entirely possible
> (again hash-consing is an example).
Simon,
If I wanted this functionality, I'd call in the professionals (like
Salil Vadhan, Michael Rabin, and Andrei Broder, whom I have cc'ed on
this message). You sent this mail to the Haskell list, but I think
the Haskell content of this problem is trivial compared to the
algorithmic problems you pose. I remember, for example, a problem in
the distant past of SRC Modula-3, where fingerprinting did not behave
as expected because the source of the fingerprint module was itself
fingerprinted.
Here are some algorithmic references:
http://tinyurl.com/yoeusg, http://tinyurl.com/2a2yfj
- Modula-3 code for compositional 64-bit fingerprints,
polynomial chosen by (literally) flipping a coin
http://citeseer.ist.psu.edu/broder93some.html
- Andrei's paper on applications of fingerprinting
How does laziness play into this problem? If you don't mind building
up a lot of thunks, you can probably just use several fingerprinting
algorithms in parallel and then pull just the one you need...
Norman
More information about the Haskell
mailing list