Using mutable array after an unsafeFreezeArray, and GC details

Thu May 8 23:18:48 UTC 2014

I have an unusual application with some unusual performance problems
and I'm trying to understand how I might use unsafeFreezeArray to help
me, as well as understand in detail what's going on with boxed mutable
arrays and GC. I'm using the interface from 'primitive' below.

First some basic questions, then a bit more background

1) What happens when I do `newArray s x >>= \a-> unsafeFreezeArray a
>> return a` and then use `a`? What problems could that cause?

2) And what if a do a `cloneMutableArray` on `a` and likewise use the
resulting array?

Background: I've been looking into an issue [1] in a library in which
as more mutable arrays are allocated, GC dominates (I think I verified
this?) and all code gets slower in proportion to the number of mutable
arrays that are hanging around.

I've been trying to understand how this is working internally. I don't
quite understand how the "remembered set" works with respect to
MutableArray. As best I understand: the remembered set in generation G
points to certain objects in older generations, which objects hold
references to objects in G. Then for MutableArrays specifically,
card-marking is used to mark regions of the array with garbage in some
way.

So my hypothesis is the slowdown is associated with the size of the
remembered set, and whatever the GC has to do on it. And in my tests,
freezing the array seems to make that overhead (at least the overhead
proportional to number of arrays) disappear.

Now I'm really lost in the woods though. My hope is that I might be
able to safely use unsafeFreezeArray to help me here [3]. Here are the
particulars of how I use MutableArray in my algorithm, which are
somewhat unusual:
  - keep around a small template `MutableArray Nothing`
  - use cloneMutableArray for fast allocation of new arrays
  - for each array only a *single* write (CAS actually) happens at each position

In fact as far as I can reason, there ought to be no garbage to
collect at all until the entire array becomes garbage (the initial
value is surely shared, especially since I'm keeping this template
array around to clone from, right?). In fact I was even playing with
the idea of rolling a new CAS that skips the card-marking stuff.

Any guidance is appreciated.

Thanks,
Brandon

[1]: http://stackoverflow.com/questions/23462004/code-becomes-slower-as-more-boxed-arrays-are-allocated
[2]: http://www.haskell.org/pipermail/glasgow-haskell-users/2012-March/022142.html
[3]: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/EagerPromotion