GHC code generation micro-optimisation patch
Remi Turk
rturk at science.uva.nl
Mon Mar 3 04:18:04 EST 2008
Hi,
during the past semester I followed a seminar on the "Efficient
implementation of functional languages" by Jeroen Fokker at the
University Utrecht. During that course we worked on a feedback
directed GHC optimisation, but that got me interested in another
possible GHC backend micro-optimisation:
The short story is this:
An 8 line patch to GHC, tested with ghc 6.8.2 on nofib, ignoring all
results with a < 0.5s runtime, yields an average runtime and
compile time improvement of about 0.6%.
The worst nofib slowdown is 5%, and the best speedup 8%
Whether this is acceptable/enough for inclusion, is of course not
up to me.
The long story is that in
compiler/codeGen/CgUtils.hs:mk_switch, an extra case is added to
detect and treat specially the case analysis of 2 constructor
datatypes.[1]
Previously, case analysis of 2 constructor data types looked as
follows in C--:
_tmp = R1 & 3;
if (_tmp >= 2) goto snd_con_lbl;
With this patch, it generates the following:
if (R1 & 2 != 0) goto snd_con_lbl;
At least on x86 machines, the resulting assembly is more
interesting. The original:
movl %esi,%eax
andl $3,%eax
cmpl $2,%eax
jae .snd_con_lbl
After the patch:
testl $2,%esi
jne .snd_con_lbl
The following table contains a summary of the nofib results:
Machine Athlon32* Athlon64* Pentium4 Pentium4-dual PentiumD
---------------------------------------------------------------------------------------
Vendor AMD AMD Intel Intel Intel
Word size 32 64 32** 32** 32**
OS Linux Linux Linux Linux Linux
# CPU's 1 2 1 2 2
Mhz 950 2600 2800 2800 3000
L1 code/uops (Kb) 64 64 12 12 12
L1 data (Kb) 64 64 8 8 16
L2 (Kb) 256 1024 512 512 2048
-1 s.d. runtime (%) -1.4 -1.2 -2.4 -2.2 -5.1
+1 s.d. runtime (%) +1.1 +0.4 +1.5 +2.1 +1.3
Avg runtime (%) -0.2 -0.4 -0.4 -0.1 -1.9
Worst runtime (%) +2.5 +1.0 +3.1 +2.5 +5.0
Worst program cacheprof power simple wheel-sieve-1 integrate
Best runtime (%) -5.1 -2.9 -5.0 -5.5 -8.1
Best program parstof hidden para life cryptarithm1
-1 s.d. mutator (%) -1.5 -1.6 -3.0 -3.0 -4.8
+1 s.d. mutator (%) +1.1 +0.8 +1.5 +2.2 +1.8
Avg mutator (%) -0.2 -0.4 -0.8 -0.5 -1.6
Avg bin size (%) -0.1 -0.2 -0.1 -0.1 -0.1
Avg mod size (%) -0.3 -0.5 -0.3 -0.3 -0.3
Avg comp time (%) -0.6 -0.7 -1.1 -0.6 -0.2
---------------------------------------------------------------------------------------
Cachegrind results
-1 s.d. runtime (%) -2.0 -2.4 -3.2 -3.8 -3.3
+1 s.d. runtime (%) +0.5 +0.8 +1.2 +1.7 +0.6
Avg runtime (%) -0.7 -0.8 -1.0 -1.1 -1.3
-1 s.d. instr. (%) -4.2 -4.1 -4.3 -4.4 -4.4
+1 s.d. instr. (%) -0.6 -0.5 -0.6 -0.7 -0.7
Avg instr (%) -2.4 -2.3 -2.5 -2.5 -2.5
-1 s.d. cache misses (%) -0.6 -1.8 -0.4 -2.0 -2.8
+1 s.d. cache misses (%) +0.7 +2.2*** +0.7 +0.9 +1.1
Avg cache misses (%) +0.0 +0.2 +0.1 -0.6 -0.9
Avg comp time (%) -0.6 +0.3 -0.7 -0.4 -0.1
* On all Athlon results, but not on those of the Pentiums,
nofib-analyse gave lots of "spurious result" warnings.
On the Athlon32, nofib even failed with "output does not
match" errors, which I could not verify when running diff on
the output manually. I "fixed" the first problem by ignoring
it and the second one by changing runstdtest to pretend the
actual output always matches the expected output.
** CPU may actually be 64bits, but all software (including kernel) is 32bits
*** cacheprof here has a whopping 21.0% more cache misses.
The second biggest increase is 1.0%
The attached patch is against 6.8.2, but currently applies without problem against the HEAD.
The full nofib results can be found at
http://www.students.cs.uu.nl/~rturk/fast2case-nofib-results/
The normal runs were done with NoFibRuns = 5, and for the
cachegrind runs it was set to 2. On most machines I ran nofib
multiple times and, both for normal and for "fast2case", took the
fastest run.
Thanks to Jeroen Fokker for giving the seminar that made me think
of this in the first place, to Mark Stobbe and Eelco Lempsink for
being great guys to work with during the seminar, and to Alexey
Rodriguez for helping me keep my sanity in the midst of weird
cache- and pipeline effects. And finally to the University
Utrecht and Chris Regenboog for enabling me to run nofib on more
computers than only my own poor old 950Mhz machine.
Groeten, Remi
[1] Technically, it will work for 4 constructor datatypes on 64
bit architectures too, but who cares.
-------------- next part --------------
--- ghc-6.8.2-normal/compiler/codeGen/CgUtils.hs 2007-12-10 19:11:32.000000000 +0100
+++ ghc-6.8.2-fast2case/compiler/codeGen/CgUtils.hs 2008-02-22 14:06:14.000000000 +0100
@@ -696,6 +696,15 @@
-- the branches is the tag 0, because comparing '== 0' is likely to be
-- more efficient than other kinds of comparison.
+-- optimize 2 constructor datatype case
+mk_switch (CmmMachOp (MO_And rep) [arg, CmmLit (CmmInt tAG_MASK' rep')]) [(1, stmts1), (2, stmts2)] Nothing _ _ via_C
+ | tAG_MASK' == fromIntegral tAG_MASK = do
+ id <- forkCgStmts stmts2
+ return $ CmmComment (mkFastString "Fast2Case") `consCgStmt` (CmmCondBranch cond id `consCgStmt` stmts1)
+ where
+ tag_expr' = CmmMachOp (MO_And rep) [arg, CmmLit (CmmInt 2 rep')]
+ cond = cmmNeWord tag_expr' (CmmLit (mkIntCLit 0))
+
-- DENSE TAG RANGE: use a switch statment.
--
-- We also use a switch uncoditionally when compiling via C, because
More information about the Glasgow-haskell-users
mailing list