Where do I start if I would like help improve GHC compilation times?

Tue Apr 18 11:01:20 UTC 2017

Hey Alfredo,

Thanks for taking a look at this. I've taken a look at this before and
collected some notes here:
https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems
Based on my investigations I concluded that there are places where
pretty-printing Asm and Cmm gets accidentally quadratic.
If I remember correctly the core of the problem was that even in the "fast"
mode the pretty printer was computing the lengths of subtrees in
`reduceDoc`, which in some cases made the operations quadratic.
I've tried implementing a "just append bytes"-mode without `reduceDoc`, but
what stopped me was my lack of understanding of different semantics for 3
kinds of empty Doc's.
I think if you took the time to understand it, you could implement it in a
way that's not quadratic, reaping a substantial performance win.

Cheers,
Bartosz

2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <alfredo.dinapoli at gmail.com>:

> Hey Simon,
>
> thanks for chiming in. I had the same suspect as well, and in fact my
> first temptation was to use dlists to avoid costly concatenation (I guess a
> Builder shares pretty much the same idea, which is avoid right-appends). I
> eventually bailed out as that Pretty module had some functions like sepX or
> fillNBE (I might be wrong, going by memory only here) which needed to
> *inspect* the current [Doc] we were carrying around, and thus dlists
> couldn’t accomplish this in O(1), but forcing it back to a list was
> necessary. Am I right in thinking that using a Builder here will suffer the
> same malady? Ideally I would like constant time for both appends, left
> concat & inspection (as in pattern-matching inspection) but I don’t think
> what I’m asking exists, at least not in its functional declination anyway ;)
>
> That’s why I was thinking to give Deques (or more generally, Sequences) a
> go: they don’t have O(1) appends but at least can be inspected in O(1).
> Question has to be answered whether this will yield improvement or not,
> which I guess depends upon the ratio of inspection / appending we do in the
> pretty printing. In that case using dlists or builders might be better.
> Last but not least, although the current implementation doesn’t backtrack
> anyway, I’m wondering wether switching to a different representation for a
> Doc (namely a huge function composition of Token(s), as described in the
> paper) could be beneficial as well.
>
> Do you guys have any opinions? Yesterday I extracted Pretty.hs from the
> sourcetree and I’m now planning to produce a criterion benchmark and
> compare different implementations, althought it’s a bit hard to predict the
> real world usage as I don’t have access to a representative Doc document as
> produced by GHC, so my benchs could be all ill-founded.
>
> Alfredo
>
> On 18 April 2017 at 12:01, Simon Marlow <marlowsd at gmail.com> wrote:
>
>> Pretty-printing the asm is a likely contender for optimisation, however
>> the problem is not the pretty-printing per se.  We don't actually use any
>> of the backtracking stuff when printing asm, since there's no point nicely
>> indenting things or wrapping lines.  The overhead is probably all in the
>> layer of data structure that we generate in Pretty before it gets dumped
>> into raw bytes.  Using a ByteString Builder instead might yield some
>> improvement.
>>
>> Cheers
>> Simon
>>
>> On 17 April 2017 at 18:44, Alfredo Di Napoli <alfredo.dinapoli at gmail.com>
>> wrote:
>>
>>> Dear all,
>>>
>>> after sprinkling (ehm, littering) GHC source code with cost centres, I
>>> was not surprised to see that roughly 20% of the compilation time (as in
>>> .prof) was spent in the core gen/simplification process (10% of the total
>>> time) and on the asm code gen (another 10%).
>>>
>>> I have almost immediately abandoned the idea of try optimising some
>>> modules in simplCore (considering my 0-knowledge of GHC internals anyway..)
>>> but I have been dwelling on the following: Outputable.hs and Pretty.hs
>>> seems to be have been implemented making deliberate use of lists and
>>> concatenations between them, which left me wondering if there was room for
>>> optimisation there. I have found this interesting paper on the topic:
>>>
>>> https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf
>>>
>>> Now, it’s totally possible that this has been already tried (with no
>>> success) but judging from the original copyright of Pretty.hs (dated 2001),
>>> it seems it was written prior to the work of Olaf Chitil (the author of the
>>> paper).
>>>
>>> TL;DR I was thinking (even just as a fun exercise to learn more about
>>> GHC internals) to leverage the ideas of that paper and switch to a
>>> different implementation for `Doc` coupled with the use of lazy dequeues,
>>> which *might* increase the performances of the codegen and thus of the
>>> compiler overall. Am I fighting a strawman (or flogging a dead horse, pick
>>> your rethorical figure :D ) or is there even a tiny chance of this being
>>> actually useful?
>>>
>>> Have a nice evening,
>>>
>>> Alfredo
>>>
>>> On 11 April 2017 at 00:47, Ben Gamari <ben at smart-cactus.org> wrote:
>>>
>>>> Alfredo Di Napoli <alfredo.dinapoli at gmail.com> writes:
>>>>
>>>> > Hey Ben,
>>>> >
>>>> Hi Alfredo,
>>>>
>>>> Sorry for the late response! The email queue from the weekend was a bit
>>>> longer than I would like.
>>>>
>>>> > as promised I’m back to you with something more articulated and
>>>> hopefully
>>>> > meaningful. I do hear you perfectly — probably trying to dive
>>>> head-first
>>>> > into this without at least a rough understanding of the performance
>>>> > hotspots or the GHC overall architecture is going to do me more harm
>>>> than
>>>> > good (I get the overall picture and I’m aware of the different stages
>>>> of
>>>> > the GHC compilation pipeline, but it’s far from saying I’m proficient
>>>> with
>>>> > the architecture as whole). I have also read a couple of years ago
>>>> the GHC
>>>> > chapter on the “Architeture of Open Source Applications” book, but I
>>>> don’t
>>>> > know how much that is still relevant. If it is, I guess I should
>>>> refresh my
>>>> > memory.
>>>> >
>>>> It sounds like you have done a good amount of reading. That's great.
>>>> Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
>>>> it's likely worthwhile diving in.
>>>>
>>>> > I’m currently trying to move on 2 fronts — please advice if I’m a fool
>>>> > flogging a dead horse or if I have any hope of getting anything done
>>>> ;)
>>>> >
>>>> > 1. I’m trying to treat indeed the compiler as a black block (as you
>>>> > adviced) trying to build a sufficiently large program where GHC is
>>>> not “as
>>>> > fast as I would like” (I know that’s a very lame definition of “slow”,
>>>> > hehe). In particular, I have built the stage2 compiler with the “prof”
>>>> > flavour as you suggested, and I have chosen 2 examples as a reference
>>>> > “benchmark” for performance; DynFlags.hs (which seems to have been
>>>> > mentioned multiple times as a GHC perf killer) and the
>>>> highlighting-kate
>>>> > package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221
>>>> .
>>>>
>>>> Indeed, #9221 would be a very interesting ticket to look at. The
>>>> highlighting-kate package is interesting in the context of that ticket
>>>> as it has a very large amount of parallelism available.
>>>>
>>>> If you do want to look at #9221, note that the cost centre profiler may
>>>> not provide the whole story. In particular, it has been speculated that
>>>> the scaling issues may be due to either,
>>>>
>>>>  * threads hitting a blackhole, resulting in blocking
>>>>
>>>>  * the usual scaling limitations of GHC's stop-the-world GC
>>>>
>>>> The eventlog may be quite useful for characterising these.
>>>>
>>>> > The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
>>>> > look at the output from the .prof file AND the `-v` flag, find any
>>>> > hotspot, try to change something, recompile, observe diff, rinse and
>>>> > repeat. Do you think I have any hope of making progress this way? In
>>>> > particular, I think compiling DynFlags.hs is a bit of a dead-end; I
>>>> > whipped up this buggy script which
>>>> > escalated into a Behemoth which is compiling pretty much half of the
>>>> > compiler once again :D
>>>> >
>>>> > ```
>>>> > #!/usr/bin/env bash
>>>> >
>>>> > ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
>>>> > -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler
>>>> -I../ghc/compiler/stage2
>>>> > \
>>>> > -I../ghc/compiler/stage2/build \
>>>> > -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile
>>>> r/typecheck:../ghc/compiler/basicTypes
>>>> > \
>>>> > -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp
>>>> iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
>>>> > \
>>>> > -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g
>>>> hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
>>>> > \
>>>> > -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil
>>>> er/simplCore:../ghc/compile/specialise
>>>> > \
>>>> > -fforce-recomp -c $@
>>>> > ```
>>>> >
>>>> > I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs`
>>>> but
>>>> > it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because
>>>> it’s
>>>> > pulling in half of the compiler anyway :D I tried to reuse the .hi
>>>> files
>>>> > from my stage2 compilation but I failed (GHC was complaining about
>>>> > interface file mismatch). Short story short, I don’t think it will be
>>>> a
>>>> > very agile way to proceed. Am I right? Do you have any recommendation
>>>> in
>>>> > such sense? Do I have any hope to compile DynFlags.hs in a way which
>>>> would
>>>> > make this perf investigation feasible?
>>>> >
>>>> What I usually do in this case is just take the relevant `ghc` command
>>>> line directly from the `make` output and execute it manually. I would
>>>> imagine your debug cycle would look something like,
>>>>
>>>>  * instrument the compiler
>>>>  * build stage1
>>>>  * use stage2 to build DynFlags using the stage1 compiler (using a
>>>> saved command line)
>>>>  * think
>>>>  * repeat
>>>>
>>>> This should only take a few minutes per iteration.
>>>>
>>>> > The second example (the highlighting-kate package) seems much more
>>>> > promising. It takes maybe 1-2 mins on my machine, which is enough to
>>>> take a
>>>> > look at the perf output. Do you think I should follow this second
>>>> lead? In
>>>> > principle any 50+ modules package I think would do (better if with a
>>>> lot of
>>>> > TH ;) ) but this seems like a low-entry barrier start.
>>>> >
>>>> > 2. The second path I’m exploring is simply to take a less holistic
>>>> approach
>>>> > and try to dive in into a performance ticket like the ones listed
>>>> here:
>>>> > https://www.reddit.com/r/haskell/comments/45q90s/is_anything
>>>> _being_done_to_remedy_the_soul/czzq6an/
>>>> > Maybe some are very specific, but it seems like fixing small things
>>>> and
>>>> > move forward could help giving me understanding of different
>>>> sub-parts of
>>>> > GHC, which seems less intimidating than the black-box approach.
>>>> >
>>>> Do you have any specific tickets from these lists that you found
>>>> interesting?
>>>>
>>>> > In conclusion, what do you think is the best approach, 1 or 2, both or
>>>> > none? ;)
>>>>
>>>> I would say that it largely depends upon what you feel most comfortable
>>>> with. If you feel up for it, I think #9221 would be a nice, fairly
>>>> self-contained, yet high-impact ticket which would be worth spending a
>>>> few days diving further into.
>>>>
>>>> Cheers,
>>>>
>>>> - Ben
>>>>
>>>>
>>>
>>> _______________________________________________
>>> ghc-devs mailing list
>>> ghc-devs at haskell.org
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>>
>>>
>>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20170418/2005a2c3/attachment-0001.html>