suboptimal ghc code generation in IO vs equivalent pure code case

Thu May 12 00:11:50 UTC 2016

On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
<harendra.kumar at gmail.com> wrote:
> Thanks Dan, that helped. I did notice and suspect the update frame and the
> unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was
> not sure what is going on. In fact I thought that the intermediate tuple
> should get optimized out since it is required only because of the realworld
> token which is not real. But it might be difficult to see that at this
> level?

The token exists as far as the STG level is concerned, I think,
because that is the only thing ensuring that things happen in the
right order. And the closure must be built to have properly formed
STG. So optimizing away the closure building would have to happen at a
level lower than STG, and I guess there is no such optimization. I'm
not sure how easy it would be to do.

> What you are saying may be true for the current implementation but in theory
> can we eliminate the intermediate closure?

I don't know. I'm a bit skeptical that building this one closure is
the only thing causing your code to be a factor of 5 slower. For
instance, another difference in the core is that the recursive call
corresponding to the result s2 happens before allocating the
additional closure. That is the case statement that unpacks the
unboxed tuple. And the whole loop happens this way, so it is actually
building a structure corresponding to the entire output list in memory
rather eagerly.

By contrast, your pure function is able to act in a streaming fashion,
if consumed properly, where only enough of the result is built to keep
driving the rest of the program. It probably runs in constant space,
while your IO-based loop has a footprint linear in the size of the
input list (in addition to having slightly more overhead per character
because of the one extra thunk), because it is a more eager program.

And having an asymptotically larger memory footprint is more likely to
explain a 5x slowdown than allocating one extra thunk for each list
concatenation, I think. (Of course, it could also be some other
factor, as well.)

You should probably run with +RTS -s (compiling with -rtsopts), and
see if the IO version has a much larger maximum residency.

Anyhow, this is fundamental to writing the algorithm using IO. It's an
algorithm that's a sequence of steps that happen in order, the number
of steps is proportional to the input list, and part of those steps is
putting stuff in memory for the results.

> So why is that? Why can't we generate (++) inline similar to (:)? How do we
> make this decision? Is there a theoretical reason for this or this is just
> an implementation artifact?

The difference between these two is that (++) is a function call, and
(:) is a constructor. Constructors are a special case, and don't need
to have all the provisions that functions in general do. The best way
to learn what the differences are is probably to read the paper about
the STG machine.

Hope this is a more fruitful lead, but it may be that there's not a
lot that can be done, without doing some baroque things to your
algorithm, because of the necessity of its using IO.

-- Dan