[Haskell-cafe] mapM is supralinear?

Tue Sep 27 14:38:14 CEST 2011

Ketil, I suppose your argument is correct for some implementations of
GC (hopefully not the ones I use). However, a trivial optimisation of
running GC with a frequency logarithmic in the (allocation rate / heap
size) seems to make almost any kind of GC amortized O(1) while keeping
the total heap bounded within constant factor of the reachable heap
size. So, we optimize the (1.) in your algorithm and in case of mapM
we should get a logarithmic (instead of linear) number of GCs with
exponentially (instead of linearly) increasing costs reaching O(N) in
the end and totalling to O(N) too!

Does anyone know if such worst-case complexity precautions are taken
in GHC? If not, why?

Malcolm, I fail to see how "They are the same thing" is compatible
with "Indeed I would". They together imply that O(1) (GC amortized
over the amount allocated) and O(+inf) (GC amortized over the amount
reclaimed) are the same thing. Also, I don't see how OOM condition is
relevant here. I may have enough memory for a lot of useful things
even without GC.

On 27 September 2011 12:42, Ketil Malde <ketil at malde.org> wrote:
>
> Here's my feeble understanding of GC:
>
> 1. GC runs when after some specified amount of allocations
> 2. GC runs in time proportional to the live heap, which it needs to
>   traverse.
>
> This means that for a long running mapM, like any other computation
> generating a long list, (1) GC will run a number of times proportional to
> the length of the list, and (2) each run will have a cost
> proportional to the length of the list.  I.e. a linear algorithm is now
> quadratic.
>
> A lazy mapM (or mapM_), consuming the list as fast as it is generated,
> will of course keep the list short/heap small, and thus the cost of each
> GC is constant (for some value of "constant").
>
> I suppose generational GC will help in practice, since the young
> generation gets to start near the end of the list, but it will still be
> linear in generated length, and you still need major GCs too,
> occasionally.
>
> Also, I guess mapM is more vulnerable to this, since the operations (IO,
> say) involved in building the list likely do short-lived allocations,
> triggering more GCs than simpler, pure computations would.
>
> Do let me know if this is probably a terribly naive view.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
>

On 27 September 2011 12:35, Malcolm Wallace <malcolm.wallace at me.com> wrote:
>
> On 27 Sep 2011, at 11:23, Arseniy Alekseyev wrote:
>
>> Malcolm, one should amortize the cost of the collection over the
>> amount of free space allocated rather than recovered
>
> They are the same thing.  You can only allocate from the space that has been recovered.  It is true that generational GC has a nursery area of largely constant size, which is always used for fresh allocation, but that is usually considered an optimisation (albeit a considerable one), which does not fundamentally change the underlying asymptotic costs of the major collections.  When you have large heap residency, the proportion of time spent in GC increases.
>
>> (there are cases
>> when no space is recovered, would you call the GC cost infinite
>> then?).
>
> Indeed I would.  When that happens, usually the program aborts without completing its computation, so the computation is infinitely delayed.
>
> Regards,
>    Malcolm
>