[Haskell-cafe] hamming distance allocation

Mon Apr 19 13:49:32 EDT 2010

Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
> Hello Daniel:
>
> My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.

Very bad. Can I see the code?

> Thank you for clarifying about the pointers.

Not to forget the Ints for counting.

>
> Slowly my memory grows up and eventually it explodes. I would expect
> that the list comprehension is lazily evaluated and therefore at any
> given time I am only executing one hamming distance. The result of the
> hamming distance is stored into a small statistics datatype I built
> (only stores sums and sum of squares and the counts). This datatype is
> updated using a foldr.

That might very well be the problem, if you update it with a foldr, you 
must construct the entire list of 2000^2 hamming-thunks before the work can 
begin.
It's probably better to use foldl' (and make the type strict) so you can 
start the work immediately.

>
> I have no idea where the leak is. What do you see in a .prof file to
> find a leak (hamming distance has the largest amount of time and %alloc)

For finding leaks, heap profiling (-h*) gives more info than -p. The .prof 
says more about where you spend your time than what hangs on to memory.

>  ? From my .prof file where would you start looking at?

- use hamming instead of hamming2
- convertIntToDouble looks suspicious
- calculating a few million Hamming distances takes some time, but what 
about getMyStats, should that really take 25%?

> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >

filter (/= 0) [hamming xs ys | xs <- example, ys <- example]

And of course, you can trivially avoid half of the work.

>
>
> Best Regards,
>
> Arnoldo Muller
>
> On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer 
<daniel.is.fischer at web.de>wrote:
> > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> > > Hello all:
> > >
> > > I want to generate some hamming distance statistics about a set of
> > > strings. As explained in another e-mail in this list, I used the
> > > following code to call the
> > > functions:
> > > (exampl holds the list of strings of size w)
> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >
> > > I have two hamming functions:
> > > -- hamming distance for variable length strings
> > > hamming :: String -> String -> Int
> > > hamming x y = hamming' x y 0
> > >     where
> > >       hamming' [] _ !c = c
> > >       hamming' _ [] !c = c
> > >       hamming' (x:xs) (y:ys) !c
> > >
> > >           | x == y = hamming' xs ys c
> > >           | otherwise = hamming' xs ys (c + 1)
> > >
> > > -- function posted in this mailing list
> > > hamming2 :: String -> String -> Int
> > > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> > >
> > > I am executing these functions millions of times and the bottleneck
> > > of my program is in them as explained by running in profiling mode
> > > with +RTS -K400M -p -RTS
> > >
> > > The costlier function is the hamming distance
> > > COST CENTRE                    MODULE               %time %alloc
> > >
> > > hamming                        Distances             66.6   41.9
> > >
> > > It says that it is performing 41% of the allocations. In the case of
> > > hamming2 the allocations go as far as 52%.
> >
> > Allocations are cheap, so that's not necessarily a problem. More
> > important is, what's the maximum residency and how much is copied
> > during GC? Are you compiling with -O2 ?
> >
> > > I could understand that
> > > there are allocations in "hamming2" because we are creating pairs,
> > > but in the case of "hamming" there should be no allocation.
> >
> > Why not? I don't know how GHC counts allocations, but everytime you go
> > from (x:xs) to xs, you need a new pointer to the tail. If that counts
> > as allocation, hamming must allocate a lot, too.
> >
> > > How can I execute my hamming functions without allocating memory?
> > >
> > > Best regards,
> > >
> > > Arnoldo Muller