[Haskell-cafe] Calculating the accuracy of bloom filters

Wed May 11 11:07:59 CEST 2011

Hi,

I'm using Bryan's bloomfilter package, and I would like to calculate the
false positive rate after constructing the bloom filter.  Two reasons:

1. Bloom filters have a power-of-two size, which means that the FP rate
you ask for becomes an approximation (upper bound), and the real FP rate
might be much smaller (down to the square of the requested value?)

2. When constructing the filters (inserting elements), I expect many
duplicate elements.  Since the FP rate only increases with insertion of
unique elements, the a priori estimate is likely an overestimation.

If I understand correctly, this is a simple matter of counting 1-bits in
the array, and then taking the proportion of set bits ('b', say) to the
power of (number of hash functions - 'h', say).  Calculating b is easily
done, but I don't know how to get at the number of hash functions.  The
Bloom filter data structure contains

  data Bloom a = B {
      hashB     ::  !(a -> [Hash])
    , shiftB    ::  !Int
    , maskB     ::  !Int
    , bitArrayB ::  !(UArray Int Hash)
    }

I'm not sure how shiftB/maskB work, but I do have a function hashB that
for any a will generate the list of hashes, i.e., I want the length of
this list (which is constant for any a).  But I need an 'a' to feed to
it, and I don't see exactly how to do that.  

I notice that Neil's cmdargs have a value 'def', which seems to produce
a "default" value, probably using some Typeable or Dynamic magic.  Is
that something I could leverage for my purposes?

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants