[Haskell-cafe] Re: hamming distance allocation

Mon Apr 19 08:36:48 EDT 2010

Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus:
> Arnoldo Muller wrote:
> > I want to generate some hamming distance statistics about a set of
> > strings.
> >
> >   filter (\x -> x /= 0) $
> >     map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
> >
> > [...]
> >
> > -- function posted in this mailing list
> > hamming2 :: String -> String -> Int
> > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> >
> > I am executing these functions millions of times and the bottleneck of
> > my program is in them as explained by running in profiling mode with 
> > +RTS -K400M -p -RTS
> >
> > The costlier function is the hamming distance
> > COST CENTRE                    MODULE               %time %alloc
> >
> > hamming                        Distances             66.6   41.9
>
> Another way to look at it is that you shouldn't optimize  hamming
> itself, but rather make sure that it's called less often!
>
> For instance, your expression can be replaced by
>
>    filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
>
> which cuts the total running time in half. It's still quadratic in the
> length of  example . I'm sure there are faster algorithms out there that

If it's likely that there are many repetitions, collect the Strings in a 
Set/Map (depending on whether you're interested in the count) and call 
hamming only on the distinct pairs.

> can bring it down to O(n log n) if you want.

I don't think so. You can't calculate the Hamming distance of x and z from 
the distances between x and y and y and z, so you'd have to consider all 
pairs of distinct strings, wouldn't you?

>
>
> Regards,
> Heinrich Apfelmus