suboptimal ghc code generation in IO vs equivalent pure code case

Sat May 14 18:31:02 UTC 2016

The difference seems to be entirely due to memory pressure. At list size
1000 both pure version and IO version perform equally. But as the size of
the list increases the pure version scales linearly while the IO version
degrades exponentially. Here are the execution times per list element in ns
as the list size increases:

Size of list  Pure       IO
1000           8.7          8.3
10000         8.7          18
100000       8.8          63
1000000     9.3          786

This seems to be due to increased GC activity in the IO case. The GC stats
for list size 1 million are:

IO case:       %GC     time      66.1%  (61.1% elapsed)
Pure case:   %GC     time       2.6%  (3.3% elapsed)

Not sure if there is a way to write this code in IO monad which can reduce
this overhead.

-harendra

On 14 May 2016 at 17:10, Harendra Kumar <harendra.kumar at gmail.com> wrote:
>
> You are right about the way IO code is generated because of the ordering
semantics. The IO version builds the list entirely in a recursive fashion
before returning it while the pure code builds it lazily. I wrote very
simple versions to keep things simpler:
>
> Pure version:
>
> f [] = []
> f (x : xs) = x : f xs
>
>
> time                11.08 ms   (10.86 ms .. 11.34 ms)
> Measured for a million elements in the list
>
>      104,041,264 bytes allocated in the heap
>           16,728 bytes copied during GC
>           35,992 bytes maximum residency (1 sample(s))
>           21,352 bytes maximum slop
>                1 MB total memory in use (0 MB lost due to fragmentation)
>
>
> IO version:
> f [] = return []
> f (x : xs) = do
>     rest <- f xs
>     return $ x : rest
>
> time                 79.66 ms   (75.49 ms .. 82.55 ms)
>
>      208,654,560 bytes allocated in the heap
>      121,045,336 bytes copied during GC
>       27,679,344 bytes maximum residency (8 sample(s))
>          383,376 bytes maximum slop
>               66 MB total memory in use (0 MB lost due to fragmentation)
>
> Even though this is a small program not doing much and therefore
enhancing even small differences to a great degree, I am not sure if I can
completely explain the difference in slowness of the order of 7.5x by just
the recursive vs lazy building of the list. I am wondering if there is
anything that is worth further investigating and improving here.
>
> -harendra
>
> On 12 May 2016 at 05:41, Dan Doel <dan.doel at gmail.com> wrote:
> >
> > On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
> > <harendra.kumar at gmail.com> wrote:
> > > Thanks Dan, that helped. I did notice and suspect the update frame
and the
> > > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I
was
> > > not sure what is going on. In fact I thought that the intermediate
tuple
> > > should get optimized out since it is required only because of the
realworld
> > > token which is not real. But it might be difficult to see that at this
> > > level?
> >
> > The token exists as far as the STG level is concerned, I think,
> > because that is the only thing ensuring that things happen in the
> > right order. And the closure must be built to have properly formed
> > STG. So optimizing away the closure building would have to happen at a
> > level lower than STG, and I guess there is no such optimization. I'm
> > not sure how easy it would be to do.
> >
> > > What you are saying may be true for the current implementation but in
theory
> > > can we eliminate the intermediate closure?
> >
> > I don't know. I'm a bit skeptical that building this one closure is
> > the only thing causing your code to be a factor of 5 slower. For
> > instance, another difference in the core is that the recursive call
> > corresponding to the result s2 happens before allocating the
> > additional closure. That is the case statement that unpacks the
> > unboxed tuple. And the whole loop happens this way, so it is actually
> > building a structure corresponding to the entire output list in memory
> > rather eagerly.
> >
> > By contrast, your pure function is able to act in a streaming fashion,
> > if consumed properly, where only enough of the result is built to keep
> > driving the rest of the program. It probably runs in constant space,
> > while your IO-based loop has a footprint linear in the size of the
> > input list (in addition to having slightly more overhead per character
> > because of the one extra thunk), because it is a more eager program.
> >
> > And having an asymptotically larger memory footprint is more likely to
> > explain a 5x slowdown than allocating one extra thunk for each list
> > concatenation, I think. (Of course, it could also be some other
> > factor, as well.)
> >
> > You should probably run with +RTS -s (compiling with -rtsopts), and
> > see if the IO version has a much larger maximum residency.
> >
> > Anyhow, this is fundamental to writing the algorithm using IO. It's an
> > algorithm that's a sequence of steps that happen in order, the number
> > of steps is proportional to the input list, and part of those steps is
> > putting stuff in memory for the results.
> >
> > > So why is that? Why can't we generate (++) inline similar to (:)? How
do we
> > > make this decision? Is there a theoretical reason for this or this is
just
> > > an implementation artifact?
> >
> > The difference between these two is that (++) is a function call, and
> > (:) is a constructor. Constructors are a special case, and don't need
> > to have all the provisions that functions in general do. The best way
> > to learn what the differences are is probably to read the paper about
> > the STG machine.
> >
> > Hope this is a more fruitful lead, but it may be that there's not a
> > lot that can be done, without doing some baroque things to your
> > algorithm, because of the necessity of its using IO.
> >
> > -- Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20160515/64f3cf73/attachment.html>