[Haskell-cafe] hamming distance allocation

John Lato jwlato at gmail.com
Mon Apr 19 08:37:33 EDT 2010


> Subject: Re: [Haskell-cafe] hamming distance allocation
>
> Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
>> Hello all:
>>
>> I want to generate some hamming distance statistics about a set of
>> strings. As explained in another e-mail in this list, I used the
>> following code to call the
>> functions:
>> (exampl holds the list of strings of size w)
>> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
>> ys <- exampl]
>>
>> I have two hamming functions:
>> -- hamming distance for variable length strings
>> hamming :: String -> String -> Int
>> hamming x y = hamming' x y 0
>>     where
>>       hamming' [] _ !c = c
>>       hamming' _ [] !c = c
>>       hamming' (x:xs) (y:ys) !c
>>           | x == y = hamming' xs ys c
>>           | otherwise = hamming' xs ys (c + 1)
>>
>> -- function posted in this mailing list
>> hamming2 :: String -> String -> Int
>> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>>
>> I am executing these functions millions of times and the bottleneck of
>> my program is in them as explained by running in profiling mode with
>> +RTS -K400M -p -RTS
>>
>> The costlier function is the hamming distance
>> COST CENTRE                    MODULE               %time %alloc
>>
>> hamming                        Distances             66.6   41.9
>>
>> It says that it is performing 41% of the allocations. In the case of
>> hamming2 the allocations go as far as 52%.
>
> Allocations are cheap, so that's not necessarily a problem. More important
> is, what's the maximum residency and how much is copied during GC?
> Are you compiling with -O2 ?
>
>> I could understand that
>> there are allocations in "hamming2" because we are creating pairs, but
>> in the case of "hamming" there should be no allocation.
>
> Why not? I don't know how GHC counts allocations, but everytime you go from
> (x:xs) to xs, you need a new pointer to the tail. If that counts as
> allocation, hamming must allocate a lot, too.

Is it really necessary to use Strings?  I think a packed type, e.g.
Vector or ByteString, would be much more efficient here.  Of course
this is only likely to be a benefit if you can move away from String
entirely.

I suspect that "hamming2" would perform better then.

John


More information about the Haskell-Cafe mailing list