Profiling and Data.HashTable

Fri Oct 14 11:29:37 EDT 2005

On 14 October 2005 15:17, Ketil Malde wrote:

> I have a program that uses hash tables to store word counts.  It can
> use few, large hash tables, or many small ones.  The problem is that
> it uses an inordinate amount of time in the latter case, and
> profiling/-sstderr shows it is GC that is causing it (accounting for
> up to 99% of the time(!))
> 
> Is there any reason to expect this behavior?
> 
> Heap profiling shows that each hash table seems to incur a memory
> overhead of approx 5K, but apart from that, I'm not able to find any
> leaks or unexpected space consumption.

I'm not certain that this is your problem, but hash tables are by
definition mutable objects, and mutable objects don't work too well with
generational garbage collection, GHC's in particular.  Basically every
GC, even the minor ones, will have to scan all the mutable objects.

GHC doesn't make any attempt to record which mutable objects have
actually been mutated since the last GC.  This is a compromise: it makes
writing to a mutable object faster, at the expense of GC time later if
you have lots of mutable objects.  We haven't tried the alternative
approach, but it's almost certainly a case of swings and roundabouts:
programs with lots of infrequently-mutated data would benefit, programs
with a few often-mutated objects would lose.  I guess we currently
optimise for the "few often-mutated objects" case.

Cheers,
	Simon