[Haskell] Re: Fingerprints and hashing

apfelmus apfelmus at quantentunnel.de
Thu Oct 11 16:33:03 EDT 2007


Simon Peyton-Jones wrote:

> For various applications (including identifying common sub-expressions,
 > and version tracking in GHC), I'd like a Haskell library that supports
 > simple fingerprint operations.

Note that for CSE and hash-consing, there is a no-collision yet very 
efficient way of fingerprinting (with a drawback, of course :).

The starting point is the question "why does cryptographic hashing 
work"? I mean, uniquely serializing (gödel numbering) a tree like

   newtype Exp = Exp (ExpF Exp)
   data ExpF a = Zero | One | Add a a | Mult a a
               deriving (Functor, Foldable, Traversable)

takes an amount of bits proportional to its size (number of 
constructors). So, how can we expect a simple 128-bit number to have 
only few collisions for a collection of medium-sized trees? The answer 
is that the collection is much too small, no computer will ever be able 
to count (or construct said trees) from 1 to 2^128 in eons.

So, the idea is to use a "local gödel numbering" and uniquely number the 
and only the trees that are actually constructed (no collisions, but few 
in numbers). In other words, every new tree created gets the gödel 
number  size collection + 1  and we can simply use

   type GödelNumber = Word32

assuming that no more than 4G of trees will ever be constructed. For 
fast hash-consing, the collection itself is a (generalized) trie

   type Collection  = ExpF GödelNumber ~~> Maybe GödelNumber

mapping each new top of a tree (constructor + gödel numbers for already 
known subtrees) to either Just its gödel number or Nothing when it's not 
in the collection yet.

With this, CSE becomes a catamorphism that allocates new gödel numbers 
if necessary

   type StateC a = Collection -> (Collection, a) deriving (Monad)

   cse :: Exp -> StateC GödelNumber
   cse = cata hashCons

   hashCons :: ExpF (StateC GödelNumber) -> StateC GödelNumber
   hashCons x0 = \c0 -> let (c,x) = sequence x0 c0 in
        case lookup x c of
          Maybe g -> (c,g)
          Nothing -> let g = size c + 1; c' = insert x g c in (c',g)
      where
      sequence :: ExpF (StateC GödelNumber) -> StateC (ExpF GödelNumber)
      sequence = Data.Traversable.sequence

CSE in the sense that all common subexpressions now have the same gödel 
number.

Of course, the drawback compared to "free-form" fingerprinting is that 
the fingerprinting and the collection now depend on each other. But for 
CSE, we have to carry the collection around anyway.


I don't know any references for that method since I came up with it 
myself and haven't searched around yet. Any pointers?

Regards,
apfelmus



More information about the Haskell mailing list