[Haskell] Re: Fingerprints and hashing
apfelmus
apfelmus at quantentunnel.de
Thu Oct 11 16:33:03 EDT 2007
Simon Peyton-Jones wrote:
> For various applications (including identifying common sub-expressions,
> and version tracking in GHC), I'd like a Haskell library that supports
> simple fingerprint operations.
Note that for CSE and hash-consing, there is a no-collision yet very
efficient way of fingerprinting (with a drawback, of course :).
The starting point is the question "why does cryptographic hashing
work"? I mean, uniquely serializing (gödel numbering) a tree like
newtype Exp = Exp (ExpF Exp)
data ExpF a = Zero | One | Add a a | Mult a a
deriving (Functor, Foldable, Traversable)
takes an amount of bits proportional to its size (number of
constructors). So, how can we expect a simple 128-bit number to have
only few collisions for a collection of medium-sized trees? The answer
is that the collection is much too small, no computer will ever be able
to count (or construct said trees) from 1 to 2^128 in eons.
So, the idea is to use a "local gödel numbering" and uniquely number the
and only the trees that are actually constructed (no collisions, but few
in numbers). In other words, every new tree created gets the gödel
number size collection + 1 and we can simply use
type GödelNumber = Word32
assuming that no more than 4G of trees will ever be constructed. For
fast hash-consing, the collection itself is a (generalized) trie
type Collection = ExpF GödelNumber ~~> Maybe GödelNumber
mapping each new top of a tree (constructor + gödel numbers for already
known subtrees) to either Just its gödel number or Nothing when it's not
in the collection yet.
With this, CSE becomes a catamorphism that allocates new gödel numbers
if necessary
type StateC a = Collection -> (Collection, a) deriving (Monad)
cse :: Exp -> StateC GödelNumber
cse = cata hashCons
hashCons :: ExpF (StateC GödelNumber) -> StateC GödelNumber
hashCons x0 = \c0 -> let (c,x) = sequence x0 c0 in
case lookup x c of
Maybe g -> (c,g)
Nothing -> let g = size c + 1; c' = insert x g c in (c',g)
where
sequence :: ExpF (StateC GödelNumber) -> StateC (ExpF GödelNumber)
sequence = Data.Traversable.sequence
CSE in the sense that all common subexpressions now have the same gödel
number.
Of course, the drawback compared to "free-form" fingerprinting is that
the fingerprinting and the collection now depend on each other. But for
CSE, we have to carry the collection around anyway.
I don't know any references for that method since I came up with it
myself and haven't searched around yet. Any pointers?
Regards,
apfelmus
More information about the Haskell
mailing list