<div dir="ltr">Hey Bartosz,<div><br></div><div>thanks for that, looks like I won the jackpot today ;)</div><div><br></div><div>Sounds like your notes are an excellent starting point. Will try to get a better understanding of that module myself, to see if we can reap some perf win here, but the fact you “have been there before” looks like this is a promising path to take.</div><div><br></div><div>Cheers!</div><div><br></div><div>Alfredo</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 18 April 2017 at 13:01, Bartosz Nitka <span dir="ltr"><<a href="mailto:niteria@gmail.com" target="_blank">niteria@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Alfredo,<div><br></div><div>Thanks for taking a look at this. I've taken a look at this before and collected some notes here: <a href="https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems" target="_blank">https://github.com/<wbr>niteria/notes/blob/master/<wbr>PrettyPrintingGHC.md#problems</a><br>Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic.</div><div>If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic.<br>I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's.<br>I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win.<br><br>Cheers,</div><div>Bartosz</div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <span dir="ltr"><<a href="mailto:alfredo.dinapoli@gmail.com" target="_blank">alfredo.dinapoli@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Simon,<div><br></div><div>thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)</div><div><br></div><div>That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.</div><div><br></div><div>Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.</div><span class="m_1143046694224870326HOEnZb"><font color="#888888"><div><br></div><div>Alfredo </div></font></span></div><div class="m_1143046694224870326HOEnZb"><div class="m_1143046694224870326h5"><div class="gmail_extra"><br><div class="gmail_quote">On 18 April 2017 at 12:01, Simon Marlow <span dir="ltr"><<a href="mailto:marlowsd@gmail.com" target="_blank">marlowsd@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se. We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines. The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes. Using a ByteString Builder instead might yield some improvement.</div><div class="gmail_quote"><br></div><div class="gmail_quote">Cheers</div><span class="m_1143046694224870326m_-2252503471149853309HOEnZb"><font color="#888888"><div class="gmail_quote">Simon</div><div class="gmail_quote"><br></div></font></span><div class="gmail_quote"><div><div class="m_1143046694224870326m_-2252503471149853309h5">On 17 April 2017 at 18:44, Alfredo Di Napoli <span dir="ltr"><<a href="mailto:alfredo.dinapoli@gmail.com" target="_blank">alfredo.dinapoli@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="m_1143046694224870326m_-2252503471149853309h5"><div dir="ltr">Dear all,<div><br></div><div>after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).</div><div><br></div><div>I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:</div><div><br></div><div><a href="https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf" target="_blank">https://www.cs.kent.ac.uk/pubs<wbr>/2005/2062/content.pdf</a><br></div><div><br></div><div>Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).</div><div><br></div><div>TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?</div><div><br></div><div>Have a nice evening,</div><div><br></div><div>Alfredo</div></div><div class="m_1143046694224870326m_-2252503471149853309m_6374604320592718687HOEnZb"><div class="m_1143046694224870326m_-2252503471149853309m_6374604320592718687h5"><div class="gmail_extra"><br><div class="gmail_quote">On 11 April 2017 at 00:47, Ben Gamari <span dir="ltr"><<a href="mailto:ben@smart-cactus.org" target="_blank">ben@smart-cactus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>Alfredo Di Napoli <<a href="mailto:alfredo.dinapoli@gmail.com" target="_blank">alfredo.dinapoli@gmail.com</a>> writes:<br>
<br>
</span>> Hey Ben,<br>
><br>
Hi Alfredo,<br>
<br>
Sorry for the late response! The email queue from the weekend was a bit<br>
longer than I would like.<br>
<span><br>
> as promised I’m back to you with something more articulated and hopefully<br>
> meaningful. I do hear you perfectly — probably trying to dive head-first<br>
> into this without at least a rough understanding of the performance<br>
> hotspots or the GHC overall architecture is going to do me more harm than<br>
> good (I get the overall picture and I’m aware of the different stages of<br>
> the GHC compilation pipeline, but it’s far from saying I’m proficient with<br>
> the architecture as whole). I have also read a couple of years ago the GHC<br>
> chapter on the “Architeture of Open Source Applications” book, but I don’t<br>
> know how much that is still relevant. If it is, I guess I should refresh my<br>
> memory.<br>
><br>
</span>It sounds like you have done a good amount of reading. That's great.<br>
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise<br>
it's likely worthwhile diving in.<br>
<span><br>
> I’m currently trying to move on 2 fronts — please advice if I’m a fool<br>
> flogging a dead horse or if I have any hope of getting anything done ;)<br>
><br>
> 1. I’m trying to treat indeed the compiler as a black block (as you<br>
> adviced) trying to build a sufficiently large program where GHC is not “as<br>
> fast as I would like” (I know that’s a very lame definition of “slow”,<br>
> hehe). In particular, I have built the stage2 compiler with the “prof”<br>
> flavour as you suggested, and I have chosen 2 examples as a reference<br>
> “benchmark” for performance; DynFlags.hs (which seems to have been<br>
> mentioned multiple times as a GHC perf killer) and the highlighting-kate<br>
> package as posted here: <a href="https://ghc.haskell.org/trac/ghc/ticket/9221" rel="noreferrer" target="_blank">https://ghc.haskell.org/trac/g<wbr>hc/ticket/9221</a> .<br>
<br>
</span>Indeed, #9221 would be a very interesting ticket to look at. The<br>
highlighting-kate package is interesting in the context of that ticket<br>
as it has a very large amount of parallelism available.<br>
<br>
If you do want to look at #9221, note that the cost centre profiler may<br>
not provide the whole story. In particular, it has been speculated that<br>
the scaling issues may be due to either,<br>
<br>
* threads hitting a blackhole, resulting in blocking<br>
<br>
* the usual scaling limitations of GHC's stop-the-world GC<br>
<br>
The eventlog may be quite useful for characterising these.<br>
<div><div class="m_1143046694224870326m_-2252503471149853309m_6374604320592718687m_3179564585870124614h5"><br>
> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,<br>
> look at the output from the .prof file AND the `-v` flag, find any<br>
> hotspot, try to change something, recompile, observe diff, rinse and<br>
> repeat. Do you think I have any hope of making progress this way? In<br>
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I<br>
> whipped up this buggy script which<br>
> escalated into a Behemoth which is compiling pretty much half of the<br>
> compiler once again :D<br>
><br>
> ```<br>
> #!/usr/bin/env bash<br>
><br>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \<br>
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2<br>
> \<br>
> -I../ghc/compiler/stage2/build \<br>
> -i../ghc/compiler/utils:../ghc<wbr>/compiler/types:../ghc/compile<wbr>r/typecheck:../ghc/compiler/ba<wbr>sicTypes<br>
> \<br>
> -i../ghc/compiler/main:../ghc/<wbr>compiler/profiling:../ghc/comp<wbr>iler/coreSyn:../ghc/compiler/i<wbr>face:../ghc/compiler/prelude<br>
> \<br>
> -i../ghc/compiler/stage2/build<wbr>:../ghc/compiler/simplStg:../g<wbr>hc/compiler/cmm:../ghc/compile<wbr>r/parser:../ghc/compiler/hsSyn<br>
> \<br>
> -i../ghc/compiler/ghci:../ghc/<wbr>compiler/deSugar:../ghc/compil<wbr>er/simplCore:../ghc/compile/sp<wbr>ecialise<br>
> \<br>
> -fforce-recomp -c $@<br>
> ```<br>
><br>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.<wbr>hs` but<br>
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s<br>
> pulling in half of the compiler anyway :D I tried to reuse the .hi files<br>
> from my stage2 compilation but I failed (GHC was complaining about<br>
> interface file mismatch). Short story short, I don’t think it will be a<br>
> very agile way to proceed. Am I right? Do you have any recommendation in<br>
> such sense? Do I have any hope to compile DynFlags.hs in a way which would<br>
> make this perf investigation feasible?<br>
><br>
</div></div>What I usually do in this case is just take the relevant `ghc` command<br>
line directly from the `make` output and execute it manually. I would<br>
imagine your debug cycle would look something like,<br>
<br>
* instrument the compiler<br>
* build stage1<br>
* use stage2 to build DynFlags using the stage1 compiler (using a saved command line)<br>
* think<br>
* repeat<br>
<br>
This should only take a few minutes per iteration.<br>
<span><br>
> The second example (the highlighting-kate package) seems much more<br>
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a<br>
> look at the perf output. Do you think I should follow this second lead? In<br>
> principle any 50+ modules package I think would do (better if with a lot of<br>
> TH ;) ) but this seems like a low-entry barrier start.<br>
><br>
> 2. The second path I’m exploring is simply to take a less holistic approach<br>
> and try to dive in into a performance ticket like the ones listed here:<br>
> <a href="https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/" rel="noreferrer" target="_blank">https://www.reddit.com/r/haske<wbr>ll/comments/45q90s/is_anything<wbr>_being_done_to_remedy_the_soul<wbr>/czzq6an/</a><br>
> Maybe some are very specific, but it seems like fixing small things and<br>
> move forward could help giving me understanding of different sub-parts of<br>
> GHC, which seems less intimidating than the black-box approach.<br>
><br>
</span>Do you have any specific tickets from these lists that you found<br>
interesting?<br>
<span><br>
> In conclusion, what do you think is the best approach, 1 or 2, both or<br>
> none? ;)<br>
<br>
</span>I would say that it largely depends upon what you feel most comfortable<br>
with. If you feel up for it, I think #9221 would be a nice, fairly<br>
self-contained, yet high-impact ticket which would be worth spending a<br>
few days diving further into.<br>
<br>
Cheers,<br>
<br>
- Ben<br>
<br>
</blockquote></div><br></div>
</div></div><br></div></div><span>______________________________<wbr>_________________<br>
ghc-devs mailing list<br>
<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a><br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bi<wbr>n/mailman/listinfo/ghc-devs</a><br>
<br></span></blockquote></div><br></div></div>
</blockquote></div><br></div>
</div></div><br>______________________________<wbr>_________________<br>
ghc-devs mailing list<br>
<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a><br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bi<wbr>n/mailman/listinfo/ghc-devs</a><br>
<br></blockquote></div><br></div>
</div></div></blockquote></div><br></div>