[Haskell-cafe] Compiler backend question

Tue Jan 1 14:43:30 EST 2008

Peter Verswyvelen wrote:
...
> I was wondering, why doesn't GHC use the GCC (or any other standard 
> compiler) backend intermediate code? The backend of GCC generates highly 
> optimized code no? Or is the intermediate code format of GCC (or other 
> compilers) not suitable for Haskell?
...

My guess is that there is scope for producing assembly code that
out-performs GCC. I speak as someone who has no actual knowledge
of how GCC works :) I have a theory that the back-end of most compilers,
including GCC, are an accretion of many hand-coded hard-coded functions
to perform very specific optimisations for particular instructions for
particular processors. These have been developed over the years by
relatively ad-hoc timings of the precise assembly code the individual
who wrote the function wanted to improve.

So what's wrong with this approach?

When a new processor is introduced it needs a different set of
optimisations, which take time to develop, which means the compiler
is sub-optimal for a while for new processors.

Also, the number of processors we would like the compiler optimised
for is large if we're really trying to get the last few percent from
every processor. Modern processors aren't simple things like the old
6502 or Z80 which had definite, easily predictable, timings. Think
different cache sizes, multiple pipelines, branch prediction etc.

"The microarchitecture of Intel and AMD CPU’s: An optimization guide
for assembly programmers and compiler makers" here
http://www.agner.org/optimize/

It seems unlikely therefore that the code generation for every modern
processor variant will be hand-optimised to account for all the small
details. Even in GCC, which no doubt has a huge amount of continuous
effort applied to it by many clever people, they are never going to
hand-code every complexity of every processor.

Take a small example : integer multiplication. Looking at the table
in the conclusion of the microarchitecture PDF we see that different
x86 processors have different relative speed of adding and multiplying.
Something like this

Execution latencies, typical   AMD64   P4E    PM   Core2
integer addition                 1      1      1      1
integer multiplication           3     10      4      3

We can compute
4 * 371
faster on P4E by doing
371 + 371 + 371 + 371
but on Core2 and AMD64 we should do
4 * 371
and on PM it might depend on other factors, I don't know. Now supposing
we allow powers of two by shifting as well as integer addition, can we
immediately say the best way of multiplying two integers on all these
processors? Maybe roughly, but do we always get the absolute fastest
code? For example it will depend on the values of the integers.
If we know one of them in advance as I've assumed above, or if we
know they frequently have particular values we might handle them
first etc. This is just one small optimisation out of many. And the
optimisations might interact.

So the inevitable conclusion is that the precise assembly code
optimisations must not be hand-coded directly into the back-end of
the compiler, as I expect they are currently in most compilers, but
that they need to be expressed at a higher level "repeated addition
is equivalent to multiplication" and adapted down to particular
processors and situations based on extensive, genetic algorithm
guided, timing runs.

Anyway, that's my wild idea thrown in the pot. Might be a nice research
paper in there for someone :)

Richard.