[Haskell-cafe] hamming distance allocation

Mon Apr 19 16:48:28 EDT 2010

Daniel thank you for all your advice.

An additional ! bang pattern in convertIntToDouble fixed the issue! Also
using a foldl'
did the trick.

Now the program runs as it should with a constant amount of memory and in a
very small amount of time.

I believe these problems are one of the major sources of frustration for
Haskell newbies. Things that could work in <X> language easily suddenly
become problems in Haskell. When you overcome these issues then you feel
happy again that you chose Haskell as the main programming language of your
research project.

Is there any guide that explains more about the "bad consumption pattern".
Are there any general rules defined to avoid these issues? It helped me to
re-read the chapter on profiling in the Real World Haskell book to sorta
understand the problem. Is there a more detailed definition of the problem
than in RWH?

Regards,

Arnoldo

On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer <daniel.is.fischer at web.de>wrote:

> Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
> > Hello Daniel:
> >
> > My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.
>
> Very bad. Can I see the code?
>
> > Thank you for clarifying about the pointers.
>
> Not to forget the Ints for counting.
>
> >
> > Slowly my memory grows up and eventually it explodes. I would expect
> > that the list comprehension is lazily evaluated and therefore at any
> > given time I am only executing one hamming distance. The result of the
> > hamming distance is stored into a small statistics datatype I built
> > (only stores sums and sum of squares and the counts). This datatype is
> > updated using a foldr.
>
> That might very well be the problem, if you update it with a foldr, you
> must construct the entire list of 2000^2 hamming-thunks before the work can
> begin.
> It's probably better to use foldl' (and make the type strict) so you can
> start the work immediately.
>
> >
> > I have no idea where the leak is. What do you see in a .prof file to
> > find a leak (hamming distance has the largest amount of time and %alloc)
>
> For finding leaks, heap profiling (-h*) gives more info than -p. The .prof
> says more about where you spend your time than what hangs on to memory.
>
> >  ? From my .prof file where would you start looking at?
>
> - use hamming instead of hamming2
> - convertIntToDouble looks suspicious
> - calculating a few million Hamming distances takes some time, but what
> about getMyStats, should that really take 25%?
>
>
> > > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > > exampl, ys <- exampl]
> > > >
>
> filter (/= 0) [hamming xs ys | xs <- example, ys <- example]
>
> And of course, you can trivially avoid half of the work.
>
> >
> >
> > Best Regards,
> >
> > Arnoldo Muller
> >
> > On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
> <daniel.is.fischer at web.de>wrote:
> > > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> > > > Hello all:
> > > >
> > > > I want to generate some hamming distance statistics about a set of
> > > > strings. As explained in another e-mail in this list, I used the
> > > > following code to call the
> > > > functions:
> > > > (exampl holds the list of strings of size w)
> > > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > > exampl, ys <- exampl]
> > > >
> > > > I have two hamming functions:
> > > > -- hamming distance for variable length strings
> > > > hamming :: String -> String -> Int
> > > > hamming x y = hamming' x y 0
> > > >     where
> > > >       hamming' [] _ !c = c
> > > >       hamming' _ [] !c = c
> > > >       hamming' (x:xs) (y:ys) !c
> > > >
> > > >           | x == y = hamming' xs ys c
> > > >           | otherwise = hamming' xs ys (c + 1)
> > > >
> > > > -- function posted in this mailing list
> > > > hamming2 :: String -> String -> Int
> > > > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> > > >
> > > > I am executing these functions millions of times and the bottleneck
> > > > of my program is in them as explained by running in profiling mode
> > > > with +RTS -K400M -p -RTS
> > > >
> > > > The costlier function is the hamming distance
> > > > COST CENTRE                    MODULE               %time %alloc
> > > >
> > > > hamming                        Distances             66.6   41.9
> > > >
> > > > It says that it is performing 41% of the allocations. In the case of
> > > > hamming2 the allocations go as far as 52%.
> > >
> > > Allocations are cheap, so that's not necessarily a problem. More
> > > important is, what's the maximum residency and how much is copied
> > > during GC? Are you compiling with -O2 ?
> > >
> > > > I could understand that
> > > > there are allocations in "hamming2" because we are creating pairs,
> > > > but in the case of "hamming" there should be no allocation.
> > >
> > > Why not? I don't know how GHC counts allocations, but everytime you go
> > > from (x:xs) to xs, you need a new pointer to the tail. If that counts
> > > as allocation, hamming must allocate a lot, too.
> > >
> > > > How can I execute my hamming functions without allocating memory?
> > > >
> > > > Best regards,
> > > >
> > > > Arnoldo Muller
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100419/8c602cee/attachment.html