[Haskell-cafe] Re: hamming distance allocation
Heinrich Apfelmus
apfelmus at quantentunnel.de
Tue Apr 20 04:52:10 EDT 2010
Daniel Fischer wrote:
> Heinrich Apfelmus:
>>
>> For instance, your expression can be replaced by
>>
>> filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
>>
>> which cuts the total running time in half. It's still quadratic in the
>> length of example . I'm sure there are faster algorithms out there that
>> can bring it down to O(n log n) if you want.
>
> I don't think so. You can't calculate the Hamming distance of x and z from
> the distances between x and y and y and z, so you'd have to consider all
> pairs of distinct strings, wouldn't you?
And there I was sure about something once, only to see that it's
actually really doubtful... ;)
The thing about the Hamming distance is that it's not a black box, so
you can't get a lower bound by counting the number of minimum calls to
hamming that have to be made. You are essentially arguing that the
different Hamming distances are independent, which they are not. Not to
mention that there are also "black-box" restrictions like the triangle
inequality
d(x,z) <= d(x,y) + d(y,z)
but that one is probably harmless. In short, the situation is similar to
how the sorting bound O(n*log n) does not apply to radix sort.
Still, you are right to question my O(n*log n) claim; so far, my
attempts at finding such an algorithm myself have failed.
More precisely, the goal is to make a histogram of the possible hamming
distances. We need at least O(n*w) time to do that, where n is the
number of strings and w their maximum length; after all, we have to
"touch" every character. For simplicity, that the characters are just
one bit each. Furthermore, we can assume that w <= log n, otherwise
there are lots of duplicate strings which can be grouped together. In
this notation, the simple algorithm takes O(n^2*w) time.
I did find a straightforward divide-and-conquer algorithm to tally small
Hamming distances, but it's not good enough for the whole histogram.
Here's the specification:
countHemming :: Int -> [Bool] -> [Bool]
countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d]
In other words, countHemming d xs ys counts the number of pairings
(x,y) whose Hamming distance is exactly d .
Now, the idea is that this can be made faster for small d . For
instance, for d == 0 , we are essentially just calculating the number
of different elements of xs and ys . By requiring that xs and ys
be sorted, this can be done in linear time
countHemming 0 xs ys = ... a variation on merge xs ys
And for slightly larger d , we can partition the lists by their first
bits and continue recursively
countHemming _ [] [] = 0
countHemming d xs ys =
countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0
+ countHemming d x0 y0 + countHemming d x1 y1
where
(x0,x1) = partitionByHead xs
(y0,y1) = partitionByHead ys
partitionByHead xs = (behead True xs, behead False xs)
behead b xs = [bs | (b':bs) <- xs, b == b']
To estimate the running time, we set n = length (xs ++ ys) and let
T(d,n) = running time of countHamming d xs ys
We started with
T(0,n) = O(n)
and want to discuss the recursive case. The idea is that each list is
divided in half, so we get
T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2)
>From this, we can calculate
T(1,n) = 2*T(0,n/2) + 2*T(1,n/2)
= O(n) + 2*T(1,n/2) -- remember quicksort!
= O(n*log n)
T(2,n) = O(n*log n) + 2*T(2,n/2)
= O(n*(log n)^2)
and so on, yielding
T(d,n) = O(n*(log n)^d)
Alas, this can be used to search a dictionary while accounting for
spelling errors, but it's no good to calculate a full histogram because
it becomes prohibitively expensive when d ~ w/2 ~ (log n)/2 .
Regards,
Heinrich Apfelmus
--
http://apfelmus.nfshost.com
More information about the Haskell-Cafe
mailing list