[Haskell-cafe] Optimizing hash array mapped tries

Mon Feb 22 02:15:28 EST 2010

Hello all,

I spent this weekend attempting to impement Phil Bagwell's Hash
Array Mapped Tries in Haskell.  You can view the fruits of this work at
this repository:

    http://github.com/ezyang/hamt

Unfortunately, I'm getting pretty abysmal performance out of the
implementation I wrote, so I'm hoping to get some performance tuning
insights here.

My cost centers look about like this:

                                                         time% alloc%
    i-Full-update                  HAMT                  29.1   31.4
    lookup'                        HAMT                  13.7    4.3
    main                           Main                  12.8   19.7
    subkey                         HAMT                  11.5   10.0
    i-Bitmap                       HAMT                   9.4   12.0
    i-Full                         HAMT                   8.1   12.9
    insert'                        HAMT                   5.1    0.6
    popCount                       PopCount               4.7    1.0

And my GC stats look like:

    ./HAMTTest 256000 +RTS -t 
    275576630962922
    <<ghc: 410718260 bytes, 788 GCs, 19646424/46880276 avg/max bytes residency (9 samples),
      138M in use, 0.00 INIT (0.00 elapsed), 1.14 MUT (1.13 elapsed), 1.88 GC (2.00 elapsed) :ghc>>
    + 0:03.16
    ./IntMapTest 256000 +RTS -t 
    -710684043813
    <<ghc: 172126372 bytes, 329 GCs, 5552955/10483896 avg/max bytes residency (9 samples),
      31M in use, 0.00 INIT (0.00 elapsed), 0.51 MUT (0.51 elapsed), 0.62 GC (0.67 elapsed) :ghc>>
    + 0:01.18

IntMap is included for comparison.  The HAMT does about x3-5 worse than
IntMap, and uses about x4 more memory (and gets correspondingly
penalized when garbage collection occurs).

My observations/suspicions are as such:

* i-Full-update essentially copies a 32-size vector, with a change to
  one element.  I think I am getting brutally punished for this, in
  terms of both memory usage as well as runtime.  What I'm curious is
  whether or not this is intrinsic to the algorithm, or if it's
  something special that GHC is doing.

* Bagwell suggests that each node in the Array-Mapped Trie should take
  only two words.  I think I'm losing another two words to storing
  bounds information in vector, as well as the costs of boxing (though I
  can't tell if GHC is boxing or not).  I've done some playing around
  with the data declaration, but it's current form seems to result in
  the fastest implementation.  I wonder if this is what is causing the
  x4 ballooning of memory, or if it's something else.  I haven't tried
  rewriting the code to use GHC's arrays.

* Increasing the heap size seems to generally improve the performance
  the HAMT; a 1G heap on a 512K sample set reduces the difference to
  about x1.5-2 slower.  It's hard to tell if the memory allocation
  system is working for or against me.

What bothers me is even if I could magically wave away all GC time in
HAMT, it would only barely win against IntMap.  Maybe the algorithm is
simply not as good as a good ole' Patricia Trie; but then again, maybe
not.

Comments and suggestions greatly appreciated.

Thanks,
Edward