[Haskell-cafe] Bloom Filter

Wed May 2 08:21:50 EDT 2007

Hi Andrew,

Thanks for the comments, it really helps to have someone else's
opinion on my code.  I'll be applying what you've said as soon as I
get a chance and I'm sure I'll have some more questions then. I'll
certainly look more closely at the Set interface and try and duplicate
all the parts which make sense.

I've been using Darcs for a while with non-haskell projects as well as
this project, however it seems that cabal strips out the darcs
meta-data when making up a distribution tar file. Is there an option
to have it include the darcs stuff? it seems like it could be quite
useful and I can't really see a downside. If you're interested the
Darcs repository is at:

http://www.almostobsolete.net/bloom/

Tom

On 5/1/07, ajb at spamcop.net <ajb at spamcop.net> wrote:
> G'day.
>
> Quoting tom <tom at almostobsolete.net>:
>
> > I'm pretty new to Haskell, I've been working on a Bloom filter[1]
> > implementation as a learning exercise.
>
> Excellent!  Sounds like a fun test.
>
> > I'd really appreciate it if someone more experienced would comment on
> > the code. I'm sure there's plenty of places where I'm doing things in
> > silly or overly complex ways.
>
> Sure.
>
> All in all, very well done.  It works, and it looks pretty efficient.
> My quibbles are mostly stylistic or syntactic in nature.  Please
> understand that the relative triviality of my quibbles is a sign that
> there are really no major problems.
>
> This is not a criticism, but more an advertisement: What are you using
> for source control here?  Darcs is nice, and as a bonus, it's trivially
> browsable from a web browser, which saves downloading and unpacking.
>
> General comments:
>
> You overuse parentheses.  A lot.  Definitions like this:
>
>     ary = (listArray (0, wordc-1) (repeat 0))
>
> don't need parentheses around them, and just add to the general noise
> level.
>
> And (.&. ((size b)-1)) is much more cleanly expressed as (.&. (size b - 1)).
>
> Rather than carrying around a hash function, it might be better to use
> a type class:
>
>     class BloomHash k where
>         bloomHash :: k -> [Word8]
>
> In wordsize:
>
> You don't need to hard-code this.  You can use:
>
>     wordsize = bitSize (undefined::Word32)  -- Or Int, of course!
>
> bitSize is defined in Data.Bits.
>
> In splitup:
>
> I got a bit confused by the local binding names.  It's usual, especially
> in generic code, to use "xs", "ys" etc for a list of "x" and "y".
> Something like this might be more idiomatic:
>
> splitup n xs = let (xs1, xs2) = splitAt n xs
>                in xs1 : splitup n xs2
>
> In indexes:
>
>     (fromIntegral $ x `div` wordsize, fromIntegral $ x .&. (wordsize-1))
>
> Seems intuitively wasteful.  Either use divMod or bit operations.
>
> Similarly, (hashfunc b) key is the same as hashfunc b key.  But even
> better is:
>
>     split bytecount . hashfunc b $ key
>
> That makes it obvious that it's a pipeline of functions applied to the key.
>
> This looks cool:
>
>     bytes2int = foldr ((. (256 *)) . (+)) 0 . (map toInteger)
>
> but I'm not smart enough to parse it.  This is both more readable and
> shorter:
>
>     bytes2int = foldr (\x r -> r*256 + fromInteger x) 0
>
> Integer log2's are probably better done using integers only, or at least
> abstracted out into a separate function.
>
> In bloom:
>
> Function guards are your friends!  This:
>
>     bloom hf sz hc = if condition
>                      then b
>                      else error "Badness"
>
> is almost always better expressed as:
>
>     bloom hf sz hc
>       | condition = b
>       | otherwise = error "Badness"
>
> You can now inline b.  (I can see why you put it in a where clause; now
> you don't have to.)
>
> wordc, again, only needs integral arithmetic:
>
>     wordc = ceiling ((fromIntegral a) / (fromIntegral b :: Double))
>
> is more or less:
>
>     wordc = (a+b-1) `div` b
>
> And drop the parentheses around the definition of ary.
>
> In add:
>
> Try to use function names that are close to names in existing libraries,
> like Data.Set.  "insert" sounds better here.
>
> Also, rather than this:
>
>     add :: Bloom a -> a -> Bloom a
>
> a better argument order is this:
>
>     insert :: a -> Bloom a -> Bloom a
>
> That way, you can use it with foldr.
>
> In test:
>
> Again, probably misnamed.  Data.Set calls this "member".  And again,
> arguably the wrong argument ordering.
>
> Once again, well done.
>
> Cheers,
> Andrew Bromage
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>