[GHC] #14980: Runtime performance regression with binary operations on vectors

Tue Mar 27 08:18:32 UTC 2018

#14980: Runtime performance regression with binary operations on vectors
-------------------------------------+-------------------------------------
           Reporter:  ttylec         |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Compiler       |           Version:  8.2.2
           Keywords:  vector         |  Operating System:  Unknown/Multiple
  bitwise operations                 |
       Architecture:                 |   Type of failure:  Runtime
  Unknown/Multiple                   |  performance bug
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 Let me briefly explain our use case: we have a machine learning tools
 created in Haskell. Basically it builds association rules for a database.
 In plain English: predicates on data rows. We need to score them, so we
 need to check how many rows are "matched" by the predicate.

 In order to optimize performance, our code uses two main representations
 for data: one is a "human readable", where a values are real values. The
 second one is binarized representation for categorical data. The latter
 has is actually a family of representation, since we pack bits into tuples
 of Word64 and use bitwise operation to implement logic.

 Simplified but representative code is attached.

 In GHC 8.0.2 binary representation is faster by one order of magnitude
 than the "human readable" one:

 {{{
 ➜  ghc-bug stack exec performance-bug-pair-1
 "Generated"
 benchmarking 256 columns/raw unbox vectors
 time                 342.6 μs   (338.3 μs .. 348.0 μs)
                      0.999 R²   (0.998 R² .. 1.000 R²)
 mean                 339.3 μs   (337.5 μs .. 342.0 μs)
 std dev              7.203 μs   (5.596 μs .. 10.07 μs)
 variance introduced by outliers: 13% (moderately inflated)

 benchmarking 256 columns/binary packed
 time                 32.07 μs   (29.69 μs .. 34.14 μs)
                      0.982 R²   (0.976 R² .. 0.997 R²)
 mean                 29.97 μs   (29.41 μs .. 31.03 μs)
 std dev              2.428 μs   (1.526 μs .. 3.750 μs)
 variance introduced by outliers: 78% (severely inflated)
 }}}

 In GHC 8.2.2 (and later) binary representation performance is similar to
 "human readable", so no performance gain:

 {{{
 ➜  ghc-bug stack exec performance-bug-pair-1
 "Generated"
 benchmarking 256 columns/raw unbox vectors
 time                 442.4 μs   (406.7 μs .. 474.5 μs)
                      0.978 R²   (0.969 R² .. 0.993 R²)
 mean                 399.3 μs   (391.3 μs .. 415.1 μs)
 std dev              34.73 μs   (20.36 μs .. 53.29 μs)
 variance introduced by outliers: 71% (severely inflated)

 benchmarking 256 columns/binary packed
 time                 378.6 μs   (295.8 μs .. 488.0 μs)
                      0.637 R²   (0.492 R² .. 0.780 R²)
 mean                 568.1 μs   (437.1 μs .. 747.6 μs)
 std dev              308.7 μs   (233.6 μs .. 386.1 μs)
 variance introduced by outliers: 98% (severely inflated)
 }}}

 However, for certain compilation paths, we still can get similar speedup
 as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as
 binary representation. In the following code we run two tests: one on just
 Word64 and one of 4-tuple of Word64. The difference is that we just add
 the Word64 case:

 {{{
 ➜  ghc-bug stack exec performance-bug-pair-2
 "Generated"
 benchmarking 64 columns/raw unbox vectors
 time                 337.6 μs   (336.1 μs .. 339.3 μs)
                      0.999 R²   (0.998 R² .. 1.000 R²)
 mean                 349.6 μs   (344.4 μs .. 359.7 μs)
 std dev              23.22 μs   (15.39 μs .. 38.22 μs)
 variance introduced by outliers: 60% (severely inflated)

 benchmarking 64 columns/binary packed
 time                 21.66 μs   (21.53 μs .. 21.79 μs)
                      1.000 R²   (0.999 R² .. 1.000 R²)
 mean                 21.68 μs   (21.53 μs .. 21.89 μs)
 std dev              613.2 ns   (466.0 ns .. 806.0 ns)
 variance introduced by outliers: 30% (moderately inflated)

 benchmarking 256 columns/raw unbox vectors
 time                 344.5 μs   (341.6 μs .. 348.2 μs)
                      0.999 R²   (0.999 R² .. 1.000 R²)
 mean                 345.1 μs   (342.5 μs .. 349.3 μs)
 std dev              10.66 μs   (7.997 μs .. 16.34 μs)
 variance introduced by outliers: 25% (moderately inflated)

 benchmarking 256 columns/binary packed
 time                 28.04 μs   (27.70 μs .. 28.46 μs)
                      0.999 R²   (0.999 R² .. 1.000 R²)
 mean                 28.05 μs   (27.85 μs .. 28.30 μs)
 std dev              758.2 ns   (628.2 ns .. 907.6 ns)
 variance introduced by outliers: 27% (moderately inflated)
 }}}

 I made a variant of code with simpler accumulating function (in our code
 we accumulate pair of integers, simplified accumulator work on one
 integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:

 {{{
 ➜  ghc-bug stack exec performance-bug-2
 "Generated"
 benchmarking 64 columns/raw unbox vectors
 time                 333.8 μs   (333.0 μs .. 335.1 μs)
                      1.000 R²   (0.999 R² .. 1.000 R²)
 mean                 333.6 μs   (332.4 μs .. 335.8 μs)
 std dev              5.651 μs   (3.233 μs .. 9.582 μs)

 benchmarking 64 columns/binary packed
 time                 39.06 μs   (38.98 μs .. 39.14 μs)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 38.94 μs   (38.83 μs .. 39.14 μs)
 std dev              495.0 ns   (310.2 ns .. 782.1 ns)

 benchmarking 256 columns/raw unbox vectors
 time                 336.9 μs   (336.2 μs .. 337.9 μs)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 337.5 μs   (336.2 μs .. 339.8 μs)
 std dev              5.757 μs   (2.946 μs .. 8.979 μs)

 benchmarking 256 columns/binary packed
 time                 239.8 μs   (237.6 μs .. 243.0 μs)
                      0.998 R²   (0.996 R² .. 0.999 R²)
 mean                 251.4 μs   (247.9 μs .. 259.8 μs)
 std dev              11.50 μs   (5.662 μs .. 19.79 μs)
 variance introduced by outliers: 37% (moderately inflated)
 }}}

 In GHC 8.0.2 we have speed-up in both cases.

 What may be related: using random-fu to generate random numbers seems to
 produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs` source).

 Short description of attached sources:
 - performance-bug-pair-2.hs: using two binary representations
 - performance-bug-pair-1.hs: using one binary representation (one case
 commented)
 - performance-bug-1.hs: using one binary representation with simplified
 accumulator
 - performance-bug-2.hs: using two binary representation with simplified
 accumulator
 - performance-bug-rfu.hs: using random-fu to generate data (optional)
 - stack-8.0.yaml: stack file for GHC-8.0.2
 - stack-8.2.yaml: stack file for GHC-8.2.2
 - stack-nightly.yaml: stack file for GHC-8.4
 - performance-bug.cabal: to be able to stack build everything.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14980>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler