optimization question

Mon Feb 23 18:53:42 EST 2004

G'day all.

Quoting Peter Simons <simons at cryp.to>:

> If you don't mind using FFI, the tool of choice would
> probably be <http://www.gnu.org/software/gperf/>.

Perfect hash functions are actually not that much better than
"imperfect" hash functions for the case where you have keys to
search for which are not in the set.

"Imperfect" hashing requires that a key be scanned at least twice:
Once to compute the hash and once for the final equality test.  The
equality test may need to be performed more than once if two keys
hash to the same value.  A good rule of thumb is that hash collisions
are unlikely until you reach around sqrt(N) keys, where N is the size
of the hash space.  So, for example, for 32-bit hash values, you
almost certainly won't get a collision unless you insert 65,000 or so
keys, which is much more than Hal's 1,300 or so.

In the absence of hash collisions "perfect" hashing is pretty much the
same.  The only differences are a) the hash function is more expensive to
compute, and b) the equality test is guaranteed to happen at most once.

Perfect hashing is best when you know that you're not going to search for
keys that are not in the set.  For example, an algorithm which requires
two passes over the words in some set of documents could easily benefit
from perfect hashing, since the words that you'll find in the second pass
will only be those found in the first pass.

Radix-based searching, on the other hand, requires only one pass through
the key and no arithmetic.

Cheers,
Andrew Bromage