Performance degradation when factoring out common code

Sat Sep 9 08:00:02 UTC 2017

Do you have the code?

On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <harendra.kumar at gmail.com> wrote:
> While trying to come up with a minimal example I discovered one more
> puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is
> slowest. This is completely reverse of the expected order.
>
> ghc -O1 (-O2 is similar):
>
> time                 15.23 ms   (14.72 ms .. 15.73 ms)
>
> ghc -O0:
>
> time                 3.612 ms   (3.548 ms .. 3.728 ms)
>
> runghc:
>
> time                 2.250 ms   (2.156 ms .. 2.348 ms)
>
>
> I am grokking it further. Any pointers will be helpful. I understand that
> -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
> counterproductive. But 4x variation is a lot and this is the case with -O1
> as well which should be relatively safer than -O2 in general. Worst of all
> runghc is significantly faster than ghc. What's going on?
>
> -harendra
>
>
> On 8 September 2017 at 18:49, Harendra Kumar <harendra.kumar at gmail.com>
> wrote:
>>
>> I will try creating a minimal example and open a ticket for the inlining
>> problem, the one I am sure about.
>>
>> -harendra
>>
>> On 8 September 2017 at 18:35, Simon Peyton Jones <simonpj at microsoft.com>
>> wrote:
>>>
>>> I know that this is not an easy request, but can either of you produce a
>>> small example that demonstrates your problem?   If so, please open a ticket.
>>>
>>>
>>>
>>> I don’t like hearing about people having to use trial and error  with
>>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem
>>> unless I can reproduce it.
>>>
>>>
>>>
>>> Simon
>>>
>>>
>>>
>>> From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
>>> Harendra Kumar
>>> Sent: 08 September 2017 13:50
>>> To: Mikolaj Konarski <mikolaj.konarski at gmail.com>
>>> Cc: ghc-devs at haskell.org
>>> Subject: Re: Performance degradation when factoring out common code
>>>
>>>
>>>
>>> I should also point out that I saw performance improvements by manually
>>> factoring out and propagating some common expressions to outer loops in
>>> performance sensitive paths. Now I have made this a habit to do this
>>> manually. Not sure if something like this has also been fixed with that
>>> ticket or some other ticket.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>> On 8 September 2017 at 17:34, Harendra Kumar <harendra.kumar at gmail.com>
>>> wrote:
>>>
>>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>>> recently and I was wondering whether GHC should do better. In one case I had
>>> to use SPECIALIZE very aggressively, in another version of the same code it
>>> worked well without that. I have been doing a lot of trial and error with
>>> the INLINE/NOINLINE pragmas to figure out what the right combination is.
>>> Sometimes it just feels like black magic, because I cannot find a rationale
>>> to explain the behavior. I am not sure if there are any more such problems
>>> lurking in, perhaps this is an area where some improvement looks possible.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>>
>>>
>>> On 8 September 2017 at 17:10, Mikolaj Konarski
>>> <mikolaj.konarski at gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I've had a similar problem that's been fixed in 8.2.1:
>>>
>>> https://ghc.haskell.org/trac/ghc/ticket/12603
>>>
>>> You can also use some extreme global flags, such as
>>>
>>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>>>
>>> to get most the GHC subtlety and shyness out of the way
>>> when experimenting.
>>>
>>> Good luck
>>> Mikolaj
>>>
>>>
>>>
>>>
>>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>>> <harendra.kumar at gmail.com> wrote:
>>> > Hi,
>>> >
>>> > I have this code snippet for the bind implementation of a Monad:
>>> >
>>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>>> >         let run x = (runAsyncT x) Nothing stp yld
>>> >             yield a _ Nothing  = run $ f a
>>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>>> >         in m Nothing stp yield
>>> >
>>> > I want to have multiple versions of this implementation parameterized
>>> > by a
>>> > function, like this:
>>> >
>>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>>> >     let run x = (runAsyncT x) Nothing stp yld
>>> >         yield a _ Nothing  = run $ f a
>>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>>> >     in m Nothing stp yield
>>> >
>>> > And then the bind function becomes:
>>> >
>>> > (>>=) = bindWith (<>)
>>> >
>>> > But this leads to a performance degradation of more than 10%. inlining
>>> > does
>>> > not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
>>> > thought this should be a more or less straightforward replacement
>>> > making the
>>> > second version equivalent to the first one. But apparently there is
>>> > something going on here that makes it perform worse.
>>> >
>>> > I did not look at the core, stg or asm yet. Hoping someone can quickly
>>> > comment on it. Any ideas why is it so? Can this be worked around
>>> > somehow?
>>> >
>>> > Thanks,
>>> > Harendra
>>> >
>>>
>>> > _______________________________________________
>>> > ghc-devs mailing list
>>> > ghc-devs at haskell.org
>>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>> >
>>>
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>