Profiling and Data.HashTable

Fri Oct 14 15:30:44 EDT 2005

On Oct 14, 2005, at 10:17 AM, Ketil Malde wrote:

> Hi all,
>
> I have a program that uses hash tables to store word counts.  It can
> use few, large hash tables, or many small ones.  The problem is that
> it uses an inordinate amount of time in the latter case, and
> profiling/-sstderr shows it is GC that is causing it (accounting for
> up to 99% of the time(!))
>
> Is there any reason to expect this behavior?
>
> Heap profiling shows that each hash table seems to incur a memory
> overhead of approx 5K, but apart from that, I'm not able to find any
> leaks or unexpected space consumption.
>
>

That "5K" number made me immediately suspicious, so I took a look at  
the source code to Data.HashTable.  Sure enough, it's allocating a  
number of large IOArrays, which are filled with pointers.  The  
practical upshot is that, for a hash table with (say) 24 entries, the  
GC must scan an additional 1000 pointers and discover that each one  
is [].

I've seen other implementations of this two-level technique which use  
a smallish sEGMENT_SIZE in order to avoid excessive GC overhead for  
less-than-gigantic hash tables.  This might be worth doing in the  
Data.HashTable implementation.

[Curious: what (if anything) is being used to test Data.HashTable?   
I'd be willing to undertake very small amounts of fiddling if I could  
be sure I wasn't slowing down something which mattered.]

-Jan-Willem Maessen

>
> Suggestions?
>
> -k
> -- 
> If I haven't seen further, it is by standing in the footprints of  
> giants
>
> _______________________________________________
> Glasgow-haskell-users mailing list
> Glasgow-haskell-users at haskell.org
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>
>