[GHC] #14980: Runtime performance regression with binary operations on vectors
GHC
ghc-devs at haskell.org
Tue Mar 27 08:18:32 UTC 2018
#14980: Runtime performance regression with binary operations on vectors
-------------------------------------+-------------------------------------
Reporter: ttylec | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 8.2.2
Keywords: vector | Operating System: Unknown/Multiple
bitwise operations |
Architecture: | Type of failure: Runtime
Unknown/Multiple | performance bug
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
Let me briefly explain our use case: we have a machine learning tools
created in Haskell. Basically it builds association rules for a database.
In plain English: predicates on data rows. We need to score them, so we
need to check how many rows are "matched" by the predicate.
In order to optimize performance, our code uses two main representations
for data: one is a "human readable", where a values are real values. The
second one is binarized representation for categorical data. The latter
has is actually a family of representation, since we pack bits into tuples
of Word64 and use bitwise operation to implement logic.
Simplified but representative code is attached.
In GHC 8.0.2 binary representation is faster by one order of magnitude
than the "human readable" one:
{{{
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 342.6 μs (338.3 μs .. 348.0 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 339.3 μs (337.5 μs .. 342.0 μs)
std dev 7.203 μs (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarking 256 columns/binary packed
time 32.07 μs (29.69 μs .. 34.14 μs)
0.982 R² (0.976 R² .. 0.997 R²)
mean 29.97 μs (29.41 μs .. 31.03 μs)
std dev 2.428 μs (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)
}}}
In GHC 8.2.2 (and later) binary representation performance is similar to
"human readable", so no performance gain:
{{{
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 442.4 μs (406.7 μs .. 474.5 μs)
0.978 R² (0.969 R² .. 0.993 R²)
mean 399.3 μs (391.3 μs .. 415.1 μs)
std dev 34.73 μs (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)
benchmarking 256 columns/binary packed
time 378.6 μs (295.8 μs .. 488.0 μs)
0.637 R² (0.492 R² .. 0.780 R²)
mean 568.1 μs (437.1 μs .. 747.6 μs)
std dev 308.7 μs (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)
}}}
However, for certain compilation paths, we still can get similar speedup
as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as
binary representation. In the following code we run two tests: one on just
Word64 and one of 4-tuple of Word64. The difference is that we just add
the Word64 case:
{{{
➜ ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 337.6 μs (336.1 μs .. 339.3 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 349.6 μs (344.4 μs .. 359.7 μs)
std dev 23.22 μs (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)
benchmarking 64 columns/binary packed
time 21.66 μs (21.53 μs .. 21.79 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 21.68 μs (21.53 μs .. 21.89 μs)
std dev 613.2 ns (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)
benchmarking 256 columns/raw unbox vectors
time 344.5 μs (341.6 μs .. 348.2 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 345.1 μs (342.5 μs .. 349.3 μs)
std dev 10.66 μs (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)
benchmarking 256 columns/binary packed
time 28.04 μs (27.70 μs .. 28.46 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 28.05 μs (27.85 μs .. 28.30 μs)
std dev 758.2 ns (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)
}}}
I made a variant of code with simpler accumulating function (in our code
we accumulate pair of integers, simplified accumulator work on one
integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:
{{{
➜ ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 333.8 μs (333.0 μs .. 335.1 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 333.6 μs (332.4 μs .. 335.8 μs)
std dev 5.651 μs (3.233 μs .. 9.582 μs)
benchmarking 64 columns/binary packed
time 39.06 μs (38.98 μs .. 39.14 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.94 μs (38.83 μs .. 39.14 μs)
std dev 495.0 ns (310.2 ns .. 782.1 ns)
benchmarking 256 columns/raw unbox vectors
time 336.9 μs (336.2 μs .. 337.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 337.5 μs (336.2 μs .. 339.8 μs)
std dev 5.757 μs (2.946 μs .. 8.979 μs)
benchmarking 256 columns/binary packed
time 239.8 μs (237.6 μs .. 243.0 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 251.4 μs (247.9 μs .. 259.8 μs)
std dev 11.50 μs (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)
}}}
In GHC 8.0.2 we have speed-up in both cases.
What may be related: using random-fu to generate random numbers seems to
produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).
Short description of attached sources:
- performance-bug-pair-2.hs: using two binary representations
- performance-bug-pair-1.hs: using one binary representation (one case
commented)
- performance-bug-1.hs: using one binary representation with simplified
accumulator
- performance-bug-2.hs: using two binary representation with simplified
accumulator
- performance-bug-rfu.hs: using random-fu to generate data (optional)
- stack-8.0.yaml: stack file for GHC-8.0.2
- stack-8.2.yaml: stack file for GHC-8.2.2
- stack-nightly.yaml: stack file for GHC-8.4
- performance-bug.cabal: to be able to stack build everything.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14980>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list