[Haskell-cafe] memory usage: general questions

Wed Aug 3 08:06:18 UTC 2016

On 08/03/2016 06:54 AM, Christopher Howard wrote:
> Hi. I've spent the last few days reading about strictness,
> non-strictness, laziness, thunks, and persistence, and to be honest I'm
> still in a bit of a fog. A few questions, which I hope might help me out:
> 
> Supposing I have an expression
> 
> f (f (f b)))
> 
> where  f :: B -> B
> and B is some ginormous persistent data structure.
> and f is an "updating" function outputting some modified copy version of
> it's argument
> 
> So, a few questions:
> 
> 1) Supposing that f does not short-circuit: Is it correct to say that it
> is definitely advantageous (from a memory usage perspective) for f to be
> a strict function? Or is this heavily dependent on whether or not b
> contains unevaluated data structures (e.g., a list)?

I would say, this depends a little bit on the structure of B and how
much f forces of it. Suppose B looks like this:

data StrictList a = Cons a !(StrictList a) | Nil

data B = B Int !Int [Float] (StrictList Double)

Now lets consider some variations of f:

f (B a b c d) = B (a + 1) (b - 1) (2.5 : c) (Cons (fromIntegral a) d)

This version would only be strict in the outer constructor and would
create a B with two fields set to values and 2 to thunks. The first
field is a lazy Int, thus we can not blindly evaluate it, as it could be
_|_. Thus, we create a thunk for (a + 1), which will cost about 3 words
(function pointer and two for each argument) plus 2 words for the second
argument of (+), although the compiler may be able to optimize that away
to a statically allocated version (at least the garbage collector
recognizes small integers and replaces them with references to
statically allocated ones). The second field is strict and depends on a
strict value, thus it will just be computed and stored. If -O or
-funbox-small-strict-fields is set, it should get unpacked to an unboxed
Int#, removing any indirection via pointers. The next field is lazy
again, but we only construct values. Thus, we can construct them
directly and store the result in the field. The tail of the list may
still be lazy. The forth field contains a spine-strict list, although
the field is not marked strict, so forcing it would force the whole
list. However, elements of the list are not strict. So we get a thunk
for (fromIntegral a) (as a was lazy, we can not just compute this) and
another thunk for the constructor, which forces d (if the first element
of the list is evaluated, the whole list is evaluated, so Cons has to
evaluate d).

So is this version of f strict? It is strict in the outer layer, but we
can of course make a stricter version:

f (B !a !b !c !d) = ... (or f (B a b c d) = a `seq` b `seq` ...)

Now f would first evaluate the contents of each field of B. Thus, we
would avoid to create a space leak in the first argument of B, if we
repeatedly apply f. The second field was already strict, so nothing
changes in this case. The third field is a lazy list, thus we only
evalute the head of the list (which may be a thunk forcing additional
values) before constructing our result for this field. The last field
would now force the entire StrictList, which could increase memory usage
a lot (especially if a part of it was a thunk of the form replicate 500
1.0. The next time the StrictList is forced, it will stop after the
first Cons cell, as this being evaluated means the whole list is
evaluated. As a was already evaluated, the compiler might be able to see
that it can compute (fromIntegral a) earlier. This should take about the
same amount of work as constructing a closure, but I don't know whether
GHC does something like this today.

So the second version of f is a lot stricter, could create less thunks,
but force additional elements which would increase memory usage.

Which one is better? It depends on the use case. As you mentioned, if
you have a list or something similar to a list and consume it
incrementally, you only want to force the elements when you get to them.
However, any accumulator you carry during this should probably be
strict. Similar, if you construct a list, it could be beneficial to
force an element before constructing the cons-cell, at least for small
types like Int, and when you know that forcing them can not yield _|_.
In the end it depends heavily on your concrete problem.

> 
> 2) As far as the overall peak size of our memory graph: How will it be
> affected by the size of b, if f is non-strict? and if f is strict?

If f is not strict, it can produce a B without forcing its argument. If
it just uses the argument without evaluating it, it can only store it
inside the B it constructs, so B seems to be something like a list or
tree. Or it can call a function on it, but store the thunk of the
function call in the result, so the argument is only forced if this
thunk is forced. So if f is not strict, it has to increase the memory
usage (or throw the argument away, but this does not seem sensible in
this case). If it is strict, it can extract the relevant fields and
produce a result, maybe dropping the last reference to the argument and
making it eligible for GC.

> 
> 3) Are the "overwritten" portions of each B (i.e., those parts of the
> persistent data we no are no longer needed) guaranteed to be pushed out
> of memory at any particular point? Is this something we know will
> (might?) happen as soon as the GC is run? Or does it happen right when
> each f is applied to it's argument?
> 

Nothing is overwritten. Instead, a new version of B is allocated and the
old one is collected after the last reference is dropped. Maybe this is
best visible if you consider (f b, g b). Overwriting parts of b in f
would change the argument to g. Thus, a copy has to be produced. It
would be valid if the compiler could prove that only one reference
exists and this is passed to f, but first this proof could be difficult
and second a separate version of f would have to be compiled. This may
even reduce performance, as the GC is exploiting the fact that most data
structures are immutable (although thunks are mutable as they are
overwritten by their result). Thus, such an optimization could hurt the
GC, which would result in a global slowdown, further reducing any gains
from overwriting the contents of the argument. Additionally, GCs are
quite cheep if not much data is alive.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20160803/36d4ce3d/attachment.sig>