[GHC] #7258: Compiling DynFlags is jolly slow
GHC
ghc-devs at haskell.org
Mon Nov 13 15:00:08 UTC 2017
#7258: Compiling DynFlags is jolly slow
-------------------------------------+-------------------------------------
Reporter: simonpj | Owner: simonpj
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.6.1
Resolution: | Keywords: deriving-perf
Operating System: Unknown/Multiple | Architecture:
Type of failure: Compile-time | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by tdammers):
Replying to [comment:88 simonpj]:
> Hmm. The `Cmm` for `g10` is actually a lot bigger than that for `f10`,
and looked to me as if it too would scale quadratically.
It would, but for different reasons, and I believe I can rewrite it to not
scale quadratically (but haven't gotten around to it yet).
> Are `f10` and `g10` accurate reflections of the "scale well" and "scale
badly" versions of Read?
No; we don't actually have a "scale well" version of `Read` yet. All the
implementations of `Read` that I have produced so far are in the "scale
badly" category, although the degree to which they scale badly differs
somewhat. We have two kinds of "scale well" examples:
1. Trivial ones, where rather than nest code such that chains of closures
appear, we just call `read` or something else N times, but in benign ways;
these examples do not have much of an explanatory value, and I merely
included them to establish a performance baseline
2. Well-behaved nested / chained examples, such as the `getline-appl` one;
however, none of these do anything equivalent to `read`.
>
> I can see why things might scale badly, if we have a lot of
> {{{
> p1 >>= \x ->
> p2 >>= \y ->
> p3 >>= \z ->
> return (K x y z)
> }}}
> And `f10` has that structure, albeit with data constructors rather than
functions. BUT: if that's it, why doesn't `(>>=)` inline, so we get a
case expression instead??
I believe the `(>>=)` does in fact inline; `f10` is the kind of shape you
get when you inline the `(>>=)` from `ReadP`. `(>>=)` from other monadic
types does not introduce this kind of shape, which would explain why
`Read` misbehaves but `getLine` over `IO` does not.
>
> So if `f10` is reasonably accurate, I can see two ways forward:
>
> 1 Do something in the back end to generate less code.
>
> 2 Generate different code. I'm still intrigued (and a bit disbelieving)
why a different exposition of Read generates code that scales better.
See above; `Read` always misbehaves here, so your disbelief is entirely
justified. The bad behavior seems to be related to the actual
implementation of `(>>=)`, not so much whether the `(>>=)` gets inlined.
Or, more accurately, inlining `(>>=)` actually *is* the problem; if we
didn't inline it, then `read-appl` would compile to something equivalent
to the generic applicative example, which behaves nicely.
> Concerning (1) the kind of thing I had in mind was more like this
> {{{
> f10 =
> A (\i0 ->
> A (\i1 ->
> A (\i2 ->
> A (\i3 ->
> A (\i4 -> let t = (i0,i1,i2,i3,i4) in
> A (\i5 ->
> A (\i6 ->
> A (\i7 ->
> A (\i8 ->
> A (\i9 ->
> case t of (i0,i1,i2,i3,i4) ->
> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
> }}}
> Now no function closure gets more than 5 free variables. I imagine one
could do this as a STG-to-STG transformation if we decided that this was
really the core issue.
Further experimentation shows that this doesn't actually achieve anything;
the resulting Core is *exactly the same*, and if you put both functions
into the same compilation unit, they get optimized into expressing one in
terms of the other (I'm guessing here that CSE is smart enough to figure
out that they are equivalent). I have written 3 versions of `f10`, one as-
is, one (`g10`) using the suggested 5-tuple optimization, and one (`h10`)
using a nested-tuple optimization, and I end up with Core that has
literally these lines:
{{{
f10 = A f1
g10 = f10
h10 = f10
}}}
> A tantalising point is this. We have something like this:
> {{{
> let f4 = [i0,i1,i2,i3] \i4 ->
> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...
> }}}
> The list before the `\` are the free vars of the closure.
>
> Inside the code for `f4` we have `f4`'s function closure in hand, with
`i0...` in it. Rather than capturing `i0, i1...` as the free vars of f5,
we could just store a pointer to `f4`'s function closure (it's a kind of
tuple, after all) and fetch its free vars from there. It's as if we
wanted to say
> {{{
> let f4 = [i0,i1,i2,i3] \i4 ->
> let f5 = [f4, i4] \i5 -> ...blah...
> }}}
> with `f4` and `i4` being the free vars of `f5`'s closure. This would be
a deeper change to the code generator, but it could make a lot of sense in
cases where ''all'' of `f4`'s free variables are also free in `f5`. (If
not all where, you might get a space leak by closing over `f4`.)
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/7258#comment:89>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list