<div><br></div><div><br></div>Agreed. There's also some other mismatches between ghc and llvm in a few fun / interesting ways! <div><br></div><div><br></div><div><br></div><div><span></span>There's a lot of room for improvement in both code gens, but there's also a lot of room to improve the ease of experimenting with improvements.  Eg we don't have a peephole pass per target, so those get hacked into the pretty printing code last time I checked<span></span><br><br>On Thursday, June 16, 2016, Ben Gamari <<a href="javascript:_e(%7B%7D,'cvml','ben@smart-cactus.org');" target="_blank">ben@smart-cactus.org</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

Ccing David Spitzenberg, who has thought about proc-point splitting, which<br>

is relevant for reasons that we will see below.<br>

<br>

<br>

Harendra Kumar <<a>harendra.kumar@gmail.com</a>> writes:<br>

<br>

> On 16 June 2016 at 13:59, Ben Gamari <<a>ben@smart-cactus.org</a>> wrote:<br>

>><br>

>> It actually came to my attention while researching this that the<br>

>> -fregs-graph flag is currently silently ignored [2]. Unfortunately this<br>

>> means you'll need to build a new compiler if you want to try using it.<br>

><br>

> Yes I did try -fregs-graph and -fregs-iterative both. To debug why nothing<br>

> changed I had to compare the executables produced with and without the<br>

> flags and found them identical.  A note in the manual could have saved me<br>

> some time since that's the first place to go for help. I was wondering if I<br>

> am making a mistake in the build and if it is not being rebuilt<br>

> properly. Your note confirms my observation, it indeed does not change<br>

> anything.<br>

><br>

Indeed; I've opened D2335 [1] to reenable -fregs-graph and add an<br>

appropriate note to the users guide.<br>

<br>

>> All-in-all, the graph coloring allocator is in great need of some love;<br>

>> Harendra, perhaps you'd like to have a try at dusting it off and perhaps<br>

>> look into why it regresses in compiler performance? It would be great if<br>

>> we could use it by default.<br>

><br>

> Yes, I can try that. In fact I was going in that direction and then stopped<br>

> to look at what llvm does. llvm gave me impressive results in some cases<br>

> though not so great in others. I compared the code generated by llvm and it<br>

> perhaps did a better job in theory (used fewer instructions) but due to<br>

> more spilling the end result was pretty similar.<br>

><br>

For the record, I have also struggled with register spilling issues in<br>

the past. See, for instance, #10012, which describes a behavior which<br>

arises from the C-- sinking pass's unwillingness to duplicate code<br>

across branches. While in general it's good to avoid the code bloat that<br>

this duplication implies, in the case shown in that ticket duplicating<br>

the computation would be significantly less code than the bloat from<br>

spilling the needed results.<br>

<br>

> But I found a few interesting optimizations that llvm did. For example,<br>

> there was a heap adjustment and check in the looping path which was<br>

> redundant and was readjusted in the loop itself without use. LLVM either<br>

> removed the redundant  _adjustments_ in the loop or moved them out of the<br>

> loop. But it did not remove the corresponding heap _checks_. That makes me<br>

> wonder if the redundant heap checks can also be moved or removed. If we can<br>

> do some sort of loop analysis at the CMM level itself and avoid or remove<br>

> the redundant heap adjustments as well as checks or at least float them out<br>

> of the cycle wherever possible. That sort of optimization can make a<br>

> significant difference to my case at least. Since we are explicitly aware<br>

> of the heap at the CMM level there may be an opportunity to do better than<br>

> llvm if we optimize the generated CMM or the generation of CMM itself.<br>

><br>

Very interesting, thanks for writing this down! Indeed if these checks<br>

really are redundant then we should try to avoid them. Do you have any<br>

code you could share that demosntrates this?<br>

<br>

It would be great to open Trac tickets to track some of the optimization<br>

opportunities that you noted we may be missing. Trac tickets are far<br>

easier to track over longer durations than mailing list conversations,<br>

which tend to get lost in the noise after a few weeks pass.<br>

<br>

> A thought that came to my mind was whether we should focus on getting<br>

> better code out of the llvm backend or the native code generator. LLVM<br>

> seems pretty good at the specialized task of code generation and low level<br>

> optimization, it is well funded, widely used and has a big community<br>

> support. That allows us to leverage that huge effort and take advantage of<br>

> the new developments. Does it make sense to outsource the code generation<br>

> and low level optimization tasks to llvm and ghc focussing on higher level<br>

> optimizations which are harder to do at the llvm level? What are the<br>

> downsides of using llvm exclusively in future?<br>

><br>

<br>

There is indeed a question of where we wish to focus our optimization<br>

efforts. However, I think using LLVM exclusively would be a mistake.<br>

LLVM is a rather large dependency that has in the past been rather<br>

difficult to track (this is why we now only target one LLVM release in a<br>

given GHC release). Moreover, it's significantly slower than our<br>

existing native code generator. There are a number of reasons for this,<br>

some of which are fixable. For instance, we currently make no effort to tell<br>

LLVM which passes are worth running and which we've handled; this is<br>

something which should be fixed but will require a rather significant<br>

investment by someone to determine how GHC's and LLVM's passes overlap,<br>

how they interact, and generally which are helpful (see GHC #11295).<br>

<br>

Furthermore, there are a few annoying impedance mismatches between Cmm<br>

and LLVM's representation. This can be seen in our treatment of proc<br>

points: when we need to take the address of a block within a function<br>

LLVM requires that we break the block into a separate procedure, hiding<br>

many potential optimizations from the optimizer. This was discussed<br>

further on this list earlier this year [2]. It would be great to<br>

eliminate proc-point splitting but doing so will almost certainly<br>

require cooperation from LLVM.<br>

<br>

Cheers,<br>

<br>

- Ben<br>

<br>

<br>

[1] <a href="https://phabricator.haskell.org/D2335" target="_blank">https://phabricator.haskell.org/D2335</a><br>

[2] <a href="https://mail.haskell.org/pipermail/ghc-devs/2015-November/010535.html" target="_blank">https://mail.haskell.org/pipermail/ghc-devs/2015-November/010535.html</a><br>

</blockquote>

</div>