Compiler optimizations questions for ghc 6.10...

Wed Feb 18 04:29:42 EST 2009

2009/2/18 Tyson Whitehead <twhitehead at gmail.com>:
> On February 17, 2009 19:24:44 Max Bolingbroke wrote:
>> 2009/2/17 Tyson Whitehead <twhitehead at gmail.com>:
>> > (compiled with ghc 6.10 with options -O2 -ddump-simpl)
>
> That should have been -ddump-stranal instead of -ddump-simpl.

Right. Mystery solved. In case you were wondering, the reason the
other two lambdas in "digit" are not worker/wrappered is because GHC
only performs strictness analysis on lambdas that occur /immediately/
in the binding. So increasing the arity of digit is CRUCIAL to getting
your program to compile well, as that will cause:
* Strictness analysis on the inner lambdas
* Inlining of lvl_s1mF
* Fewer updateable thunks to be created by callers (as we won't try to
cache the results of a partial application)

>> > I was wondering why lvl_s1mF is not being inlined into a_s1Gv in the core
>> > at the bottom of this email as that is the only place it is ever
>> > referenced.
>>
>> The relevant GHC code is SimplUtils.preInlineUnconditionally. It looks
>> like it dosen't get inlined for two reasons:
>> 1) It's not a manifest lambda (it's an application) so inlining inside
>> another lambda would change the number of times the FVs of lvl_s1mF
>> might occur
>
> I have to confess my ignorance here as my google fu failed and so I still
> don't know what a manifest lambda is (other than not a application).  : )

Sorry :-). Manifest lambdas are just lambdas that occur manifestly! So e.g.:

f = \x -> g x

HAS a manifest lambda, whereas:

f' = g

Does NOT have one. No lambda immediately starts the binding.

Lambdas, manifest or otherwise, are important in this context because
any value with a leading lambda is certainly very cheap and so can be
inlined inside other lambdas - which is what you are trying to achieve
with lvl_s1mF.

(snip code and explanation)

> I just finished adding the Parse q Int type to help with email line wrapping.
> As I alluded to in my original email, if I don't have the Int overflow check
> in digit, it is not chosen as the loop breaker, all the StateT stuff is
> compiled away, and you get a really nice efficient assembler loop (which is
> important because the final FSM has to actually chew through GBs of data).
>
> The part of the code under the first lambda in digit is as follows (I didn't
> keep the original dump, so the uniques have changed here).  It's the second
> part of the Int overflow bounds check (i.e., y <= (maxBound-x)`div`10), and,
> indeed, something you don't want to compute unless the easy check fails.

Yes - GHC wants to share the work of (maxBound-x)`div`10 between
several partial applications of "digit". This is usually a good idea,
but in this case it sucks because it's resulted in a massively
increased arity. IMHO GHC should fix this by:
* Marking divInt# INLINE in the base library. This would result in
your code would just containing uses of quotInt#
* Making some operations cheap even if they may fail
(PrimOp.primpOpIsCheap should change). Though this might mean that we
turn non-terminating programs into terminating ones (such operations
get pushed inside lambdas) but this is consistent with our treatment
of lambdas generally.

Actually, your divInt# call wouldn't even usually be floated out to
between two lambdas, but at the time FloatOut runs there is something
in between the \x lambda and the lambdas from the state monad - the
monadic bind operator! So FloatOut feels free to move the computation
for "x" up even though that >>= will go away as soon as we run the
simplifier. What a disaster!

For me, making digit INLINE fixes all this, but that's probably
because the context of my code is not the same as yours (I had to
invent parts of a program to bind the bs, q and n variables). For your
immediate problem, I would suggest this bit of GHC witchdoctory:

                 where digit :: Int -> ParseInt Int Int
                       digit !x = do !y <- lift get
                                     ( if y <= (maxBound-9)`quot`10 ||
y <= ghc_hacks
                                       then let !y' = y*10+x in (lift
$ put y') >> s2
                                       else throwError "integer overflow" )
                          where {-# INLINE ghc_hacks #-}
                                ghc_hacks = (maxBound-x)`div`10

With luck, this should make the loop-invariant cheap in GHC's eyes,
preventing it from trying to share it. It works for me - but like I
say things may differ in your program.

>> In general, the -ddump-inlinings flag is useful for working out why
>> something wasn't inlined - but it wouldn't have helped you in this
>> case, because it only dumps information about inlining at call sites,
>> and you actually want an unconditional inlining to occur.
>
> I also tried that, and didn't have much luck with it.  I didn't understand the
> output, which there was 48k lines worth of, and the uniques kept changing
> which made it hard to grep for names from previous -ddump-simpl runs.

Right. I tend to use -ddump-inilnings and -dverbose-core2core (or
-ddump-simpl) together, and I have a little script that partitions the
output up by the "=====" seperators. Then I can look at the output  of
a phase (which is split into it's own file) and if I see something
that should have been inlined I can go to the output of the previous
simplifier run (also in it's own file) and grep for it's name in the
-ddump-inlinings output of that phase. This means you don't have to
match uniques between two entierly seperate runs, although annoyingly
the output format for identifier in -ddump-inlinings is slightly
different from that in Core output so grepping is not as easy as it
could be.

If you are interested in trying my script, it's available via Git at
http://github.com/batterseapower/scripts/blob/da1f24ba16c27e3994aa66f9db352ec1102c39d2/ghc-dump-split
and is called as "ghc-dump-split ghc-dump-file", where ghc-dump-file
results from redirection of GHC stdout+stderr to a file.

Hope all that helps,
Max