GHC code generation micro-optimisation patch

Mon Mar 3 04:18:04 EST 2008

Hi,

during the past semester I followed a seminar on the "Efficient
implementation of functional languages" by Jeroen Fokker at the
University Utrecht. During that course we worked on a feedback
directed GHC optimisation, but that got me interested in another
possible GHC backend micro-optimisation:

The short story is this:

An 8 line patch to GHC, tested with ghc 6.8.2 on nofib, ignoring all
results with a < 0.5s runtime, yields an average runtime and
compile time improvement of about 0.6%.
The worst nofib slowdown is 5%, and the best speedup 8%
Whether this is acceptable/enough for inclusion, is of course not
up to me.

The long story is that in

compiler/codeGen/CgUtils.hs:mk_switch, an extra case is added to
detect and treat specially the case analysis of 2 constructor
datatypes.[1]

Previously, case analysis of 2 constructor data types looked as
follows in C--:

	_tmp = R1 & 3;
	if (_tmp >= 2) goto snd_con_lbl;

With this patch, it generates the following:

	if (R1 & 2 != 0) goto snd_con_lbl;

At least on x86 machines, the resulting assembly is more
interesting. The original:

	movl %esi,%eax
	andl $3,%eax
	cmpl $2,%eax
	jae .snd_con_lbl

After the patch:

	testl $2,%esi
	jne .snd_con_lbl

The following table contains a summary of the nofib results:

Machine                  Athlon32*   Athlon64*  Pentium4   Pentium4-dual   PentiumD
---------------------------------------------------------------------------------------
Vendor                   AMD         AMD        Intel      Intel           Intel
Word size                32          64         32**       32**            32**
OS                       Linux       Linux      Linux      Linux           Linux
# CPU's                  1           2          1          2               2
Mhz                      950         2600       2800       2800            3000
L1 code/uops (Kb)        64          64         12         12              12
L1 data (Kb)             64          64         8          8               16
L2 (Kb)                  256         1024       512        512             2048

-1 s.d. runtime (%)      -1.4        -1.2       -2.4       -2.2            -5.1
+1 s.d. runtime (%)      +1.1        +0.4       +1.5       +2.1            +1.3
Avg runtime (%)          -0.2        -0.4       -0.4       -0.1            -1.9

Worst runtime (%)        +2.5        +1.0       +3.1       +2.5            +5.0
Worst program            cacheprof   power      simple     wheel-sieve-1   integrate
Best runtime (%)         -5.1        -2.9       -5.0       -5.5            -8.1
Best program             parstof     hidden     para       life            cryptarithm1

-1 s.d. mutator (%)      -1.5        -1.6       -3.0       -3.0            -4.8
+1 s.d. mutator (%)      +1.1        +0.8       +1.5       +2.2            +1.8
Avg mutator (%)          -0.2        -0.4       -0.8       -0.5            -1.6

Avg bin size (%)         -0.1        -0.2       -0.1       -0.1            -0.1
Avg mod size (%)         -0.3        -0.5       -0.3       -0.3            -0.3
Avg comp time (%)        -0.6        -0.7       -1.1       -0.6            -0.2
---------------------------------------------------------------------------------------
Cachegrind results

-1 s.d. runtime (%)      -2.0        -2.4       -3.2       -3.8            -3.3
+1 s.d. runtime (%)      +0.5        +0.8       +1.2       +1.7            +0.6
Avg runtime (%)          -0.7        -0.8       -1.0       -1.1            -1.3

-1 s.d. instr. (%)       -4.2        -4.1       -4.3       -4.4            -4.4
+1 s.d. instr. (%)       -0.6        -0.5       -0.6       -0.7            -0.7
Avg instr (%)            -2.4        -2.3       -2.5       -2.5            -2.5

-1 s.d. cache misses (%) -0.6        -1.8       -0.4       -2.0            -2.8
+1 s.d. cache misses (%) +0.7        +2.2***    +0.7       +0.9            +1.1
Avg cache misses (%)     +0.0        +0.2       +0.1       -0.6            -0.9

Avg comp time (%)        -0.6        +0.3       -0.7       -0.4            -0.1

*   On all Athlon results, but not on those of the Pentiums,
    nofib-analyse gave lots of "spurious result" warnings.
    On the Athlon32, nofib even failed with "output does not
    match" errors, which I could not verify when running diff on
    the output manually. I "fixed" the first problem by ignoring
    it and the second one by changing runstdtest to pretend the
    actual output always matches the expected output.
**  CPU may actually be 64bits, but all software (including kernel) is 32bits
*** cacheprof here has a whopping 21.0% more cache misses.
    The second biggest increase is 1.0%

The attached patch is against 6.8.2, but currently applies without problem against the HEAD.

The full nofib results can be found at
http://www.students.cs.uu.nl/~rturk/fast2case-nofib-results/

The normal runs were done with NoFibRuns = 5, and for the
cachegrind runs it was set to 2. On most machines I ran nofib
multiple times and, both for normal and for "fast2case", took the
fastest run.

Thanks to Jeroen Fokker for giving the seminar that made me think
of this in the first place, to Mark Stobbe and Eelco Lempsink for
being great guys to work with during the seminar, and to Alexey
Rodriguez for helping me keep my sanity in the midst of weird
cache- and pipeline effects. And finally to the University
Utrecht and Chris Regenboog for enabling me to run nofib on more
computers than only my own poor old 950Mhz machine.

Groeten, Remi

[1] Technically, it will work for 4 constructor datatypes on 64
    bit architectures too, but who cares.
-------------- next part --------------

--- ghc-6.8.2-normal/compiler/codeGen/CgUtils.hs	2007-12-10 19:11:32.000000000 +0100
+++ ghc-6.8.2-fast2case/compiler/codeGen/CgUtils.hs	2008-02-22 14:06:14.000000000 +0100
@@ -696,6 +696,15 @@
 -- the branches is the tag 0, because comparing '== 0' is likely to be
 -- more efficient than other kinds of comparison.
 
+-- optimize 2 constructor datatype case
+mk_switch (CmmMachOp (MO_And rep) [arg, CmmLit (CmmInt tAG_MASK' rep')]) [(1, stmts1), (2, stmts2)] Nothing _ _ via_C
+    | tAG_MASK' == fromIntegral tAG_MASK = do
+    id <- forkCgStmts stmts2
+    return $ CmmComment (mkFastString "Fast2Case") `consCgStmt` (CmmCondBranch cond id `consCgStmt` stmts1)
+    where
+        tag_expr'   = CmmMachOp (MO_And rep) [arg, CmmLit (CmmInt 2 rep')]
+        cond        = cmmNeWord tag_expr' (CmmLit (mkIntCLit 0))
+
 -- DENSE TAG RANGE: use a switch statment.
 --
 -- We also use a switch uncoditionally when compiling via C, because