Where do I start if I would like help improve GHC compilation times?

Tue Apr 18 11:12:46 UTC 2017

Hey Bartosz,

thanks for that, looks like I won the jackpot today ;)

Sounds like your notes are an excellent starting point. Will try to get a
better understanding of that module myself, to see if we can reap some perf
win here, but the fact you “have been there before” looks like this is a
promising path to take.

Cheers!

Alfredo

On 18 April 2017 at 13:01, Bartosz Nitka <niteria at gmail.com> wrote:

> Hey Alfredo,
>
> Thanks for taking a look at this. I've taken a look at this before and
> collected some notes here: https://github.com/niteria/notes/blob/master/
> PrettyPrintingGHC.md#problems
> Based on my investigations I concluded that there are places where
> pretty-printing Asm and Cmm gets accidentally quadratic.
> If I remember correctly the core of the problem was that even in the
> "fast" mode the pretty printer was computing the lengths of subtrees in
> `reduceDoc`, which in some cases made the operations quadratic.
> I've tried implementing a "just append bytes"-mode without `reduceDoc`,
> but what stopped me was my lack of understanding of different semantics for
> 3 kinds of empty Doc's.
> I think if you took the time to understand it, you could implement it in a
> way that's not quadratic, reaping a substantial performance win.
>
> Cheers,
> Bartosz
>
> 2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <alfredo.dinapoli at gmail.com>:
>
>> Hey Simon,
>>
>> thanks for chiming in. I had the same suspect as well, and in fact my
>> first temptation was to use dlists to avoid costly concatenation (I guess a
>> Builder shares pretty much the same idea, which is avoid right-appends). I
>> eventually bailed out as that Pretty module had some functions like sepX or
>> fillNBE (I might be wrong, going by memory only here) which needed to
>> *inspect* the current [Doc] we were carrying around, and thus dlists
>> couldn’t accomplish this in O(1), but forcing it back to a list was
>> necessary. Am I right in thinking that using a Builder here will suffer the
>> same malady? Ideally I would like constant time for both appends, left
>> concat & inspection (as in pattern-matching inspection) but I don’t think
>> what I’m asking exists, at least not in its functional declination anyway ;)
>>
>> That’s why I was thinking to give Deques (or more generally, Sequences) a
>> go: they don’t have O(1) appends but at least can be inspected in O(1).
>> Question has to be answered whether this will yield improvement or not,
>> which I guess depends upon the ratio of inspection / appending we do in the
>> pretty printing. In that case using dlists or builders might be better.
>> Last but not least, although the current implementation doesn’t backtrack
>> anyway, I’m wondering wether switching to a different representation for a
>> Doc (namely a huge function composition of Token(s), as described in the
>> paper) could be beneficial as well.
>>
>> Do you guys have any opinions? Yesterday I extracted Pretty.hs from the
>> sourcetree and I’m now planning to produce a criterion benchmark and
>> compare different implementations, althought it’s a bit hard to predict the
>> real world usage as I don’t have access to a representative Doc document as
>> produced by GHC, so my benchs could be all ill-founded.
>>
>> Alfredo
>>
>> On 18 April 2017 at 12:01, Simon Marlow <marlowsd at gmail.com> wrote:
>>
>>> Pretty-printing the asm is a likely contender for optimisation, however
>>> the problem is not the pretty-printing per se.  We don't actually use any
>>> of the backtracking stuff when printing asm, since there's no point nicely
>>> indenting things or wrapping lines.  The overhead is probably all in the
>>> layer of data structure that we generate in Pretty before it gets dumped
>>> into raw bytes.  Using a ByteString Builder instead might yield some
>>> improvement.
>>>
>>> Cheers
>>> Simon
>>>
>>> On 17 April 2017 at 18:44, Alfredo Di Napoli <alfredo.dinapoli at gmail.com
>>> > wrote:
>>>
>>>> Dear all,
>>>>
>>>> after sprinkling (ehm, littering) GHC source code with cost centres, I
>>>> was not surprised to see that roughly 20% of the compilation time (as in
>>>> .prof) was spent in the core gen/simplification process (10% of the total
>>>> time) and on the asm code gen (another 10%).
>>>>
>>>> I have almost immediately abandoned the idea of try optimising some
>>>> modules in simplCore (considering my 0-knowledge of GHC internals anyway..)
>>>> but I have been dwelling on the following: Outputable.hs and Pretty.hs
>>>> seems to be have been implemented making deliberate use of lists and
>>>> concatenations between them, which left me wondering if there was room for
>>>> optimisation there. I have found this interesting paper on the topic:
>>>>
>>>> https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf
>>>>
>>>> Now, it’s totally possible that this has been already tried (with no
>>>> success) but judging from the original copyright of Pretty.hs (dated 2001),
>>>> it seems it was written prior to the work of Olaf Chitil (the author of the
>>>> paper).
>>>>
>>>> TL;DR I was thinking (even just as a fun exercise to learn more about
>>>> GHC internals) to leverage the ideas of that paper and switch to a
>>>> different implementation for `Doc` coupled with the use of lazy dequeues,
>>>> which *might* increase the performances of the codegen and thus of the
>>>> compiler overall. Am I fighting a strawman (or flogging a dead horse, pick
>>>> your rethorical figure :D ) or is there even a tiny chance of this being
>>>> actually useful?
>>>>
>>>> Have a nice evening,
>>>>
>>>> Alfredo
>>>>
>>>> On 11 April 2017 at 00:47, Ben Gamari <ben at smart-cactus.org> wrote:
>>>>
>>>>> Alfredo Di Napoli <alfredo.dinapoli at gmail.com> writes:
>>>>>
>>>>> > Hey Ben,
>>>>> >
>>>>> Hi Alfredo,
>>>>>
>>>>> Sorry for the late response! The email queue from the weekend was a bit
>>>>> longer than I would like.
>>>>>
>>>>> > as promised I’m back to you with something more articulated and
>>>>> hopefully
>>>>> > meaningful. I do hear you perfectly — probably trying to dive
>>>>> head-first
>>>>> > into this without at least a rough understanding of the performance
>>>>> > hotspots or the GHC overall architecture is going to do me more harm
>>>>> than
>>>>> > good (I get the overall picture and I’m aware of the different
>>>>> stages of
>>>>> > the GHC compilation pipeline, but it’s far from saying I’m
>>>>> proficient with
>>>>> > the architecture as whole). I have also read a couple of years ago
>>>>> the GHC
>>>>> > chapter on the “Architeture of Open Source Applications” book, but I
>>>>> don’t
>>>>> > know how much that is still relevant. If it is, I guess I should
>>>>> refresh my
>>>>> > memory.
>>>>> >
>>>>> It sounds like you have done a good amount of reading. That's great.
>>>>> Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
>>>>> it's likely worthwhile diving in.
>>>>>
>>>>> > I’m currently trying to move on 2 fronts — please advice if I’m a
>>>>> fool
>>>>> > flogging a dead horse or if I have any hope of getting anything done
>>>>> ;)
>>>>> >
>>>>> > 1. I’m trying to treat indeed the compiler as a black block (as you
>>>>> > adviced) trying to build a sufficiently large program where GHC is
>>>>> not “as
>>>>> > fast as I would like” (I know that’s a very lame definition of
>>>>> “slow”,
>>>>> > hehe). In particular, I have built the stage2 compiler with the
>>>>> “prof”
>>>>> > flavour as you suggested, and I have chosen 2 examples as a reference
>>>>> > “benchmark” for performance; DynFlags.hs (which seems to have been
>>>>> > mentioned multiple times as a GHC perf killer) and the
>>>>> highlighting-kate
>>>>> > package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221
>>>>> .
>>>>>
>>>>> Indeed, #9221 would be a very interesting ticket to look at. The
>>>>> highlighting-kate package is interesting in the context of that ticket
>>>>> as it has a very large amount of parallelism available.
>>>>>
>>>>> If you do want to look at #9221, note that the cost centre profiler may
>>>>> not provide the whole story. In particular, it has been speculated that
>>>>> the scaling issues may be due to either,
>>>>>
>>>>>  * threads hitting a blackhole, resulting in blocking
>>>>>
>>>>>  * the usual scaling limitations of GHC's stop-the-world GC
>>>>>
>>>>> The eventlog may be quite useful for characterising these.
>>>>>
>>>>> > The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
>>>>> > look at the output from the .prof file AND the `-v` flag, find any
>>>>> > hotspot, try to change something, recompile, observe diff, rinse and
>>>>> > repeat. Do you think I have any hope of making progress this way? In
>>>>> > particular, I think compiling DynFlags.hs is a bit of a dead-end; I
>>>>> > whipped up this buggy script which
>>>>> > escalated into a Behemoth which is compiling pretty much half of the
>>>>> > compiler once again :D
>>>>> >
>>>>> > ```
>>>>> > #!/usr/bin/env bash
>>>>> >
>>>>> > ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
>>>>> > -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler
>>>>> -I../ghc/compiler/stage2
>>>>> > \
>>>>> > -I../ghc/compiler/stage2/build \
>>>>> > -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile
>>>>> r/typecheck:../ghc/compiler/basicTypes
>>>>> > \
>>>>> > -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp
>>>>> iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
>>>>> > \
>>>>> > -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g
>>>>> hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
>>>>> > \
>>>>> > -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil
>>>>> er/simplCore:../ghc/compile/specialise
>>>>> > \
>>>>> > -fforce-recomp -c $@
>>>>> > ```
>>>>> >
>>>>> > I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs`
>>>>> but
>>>>> > it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because
>>>>> it’s
>>>>> > pulling in half of the compiler anyway :D I tried to reuse the .hi
>>>>> files
>>>>> > from my stage2 compilation but I failed (GHC was complaining about
>>>>> > interface file mismatch). Short story short, I don’t think it will
>>>>> be a
>>>>> > very agile way to proceed. Am I right? Do you have any
>>>>> recommendation in
>>>>> > such sense? Do I have any hope to compile DynFlags.hs in a way which
>>>>> would
>>>>> > make this perf investigation feasible?
>>>>> >
>>>>> What I usually do in this case is just take the relevant `ghc` command
>>>>> line directly from the `make` output and execute it manually. I would
>>>>> imagine your debug cycle would look something like,
>>>>>
>>>>>  * instrument the compiler
>>>>>  * build stage1
>>>>>  * use stage2 to build DynFlags using the stage1 compiler (using a
>>>>> saved command line)
>>>>>  * think
>>>>>  * repeat
>>>>>
>>>>> This should only take a few minutes per iteration.
>>>>>
>>>>> > The second example (the highlighting-kate package) seems much more
>>>>> > promising. It takes maybe 1-2 mins on my machine, which is enough to
>>>>> take a
>>>>> > look at the perf output. Do you think I should follow this second
>>>>> lead? In
>>>>> > principle any 50+ modules package I think would do (better if with a
>>>>> lot of
>>>>> > TH ;) ) but this seems like a low-entry barrier start.
>>>>> >
>>>>> > 2. The second path I’m exploring is simply to take a less holistic
>>>>> approach
>>>>> > and try to dive in into a performance ticket like the ones listed
>>>>> here:
>>>>> > https://www.reddit.com/r/haskell/comments/45q90s/is_anything
>>>>> _being_done_to_remedy_the_soul/czzq6an/
>>>>> > Maybe some are very specific, but it seems like fixing small things
>>>>> and
>>>>> > move forward could help giving me understanding of different
>>>>> sub-parts of
>>>>> > GHC, which seems less intimidating than the black-box approach.
>>>>> >
>>>>> Do you have any specific tickets from these lists that you found
>>>>> interesting?
>>>>>
>>>>> > In conclusion, what do you think is the best approach, 1 or 2, both
>>>>> or
>>>>> > none? ;)
>>>>>
>>>>> I would say that it largely depends upon what you feel most comfortable
>>>>> with. If you feel up for it, I think #9221 would be a nice, fairly
>>>>> self-contained, yet high-impact ticket which would be worth spending a
>>>>> few days diving further into.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> - Ben
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> ghc-devs mailing list
>>>> ghc-devs at haskell.org
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20170418/9ee010cb/attachment-0001.html>