[GHC] #8647: Reduce allocations in `integer-gmp`

Fri Jan 3 12:55:33 UTC 2014

#8647: Reduce allocations in `integer-gmp`
------------------------------+--------------------------------------------
       Reporter:  hvr         |             Owner:
           Type:  task        |            Status:  new
       Priority:  normal      |         Milestone:  7.8.1
      Component:  libraries   |           Version:  7.6.3
  (other)                     |  Operating System:  Unknown/Multiple
       Keywords:              |   Type of failure:  Runtime performance bug
   Architecture:  x86_64      |         Test Case:
  (amd64)                     |          Blocking:
     Difficulty:  Unknown     |
     Blocked By:              |
Related Tickets:  #8638       |
------------------------------+--------------------------------------------
 I've added `printf(3)`s to `integer-gmp`s GMP allocation/reallocation
 functions, and noticed there to a significant amount of allocations going
 on.

 This is due to the fact that the CMM wrappers call `gmpz_init()` to
 initialize the result, and this already allocates a 1-limb `StgArray`. But
 the actual GMP operations perform a reallocate-call  (which by contract
 also needs `memcpy` the untouched limb over to the newly allocated
 structure) based on the type of operation and the involved operands,
 causing the optimistically allocated 1-limb structure to become garbage
 right away w/o even being written to.

 I've been able to get rid of most such superfluous allocations by
 replacing

 {{{#!c
 ccall __gmpz_init(mp_result1 "ptr");
 }}}

 by

 {{{#!c
 MP_INT__mp_alloc(mp_result1) = 0;
 MP_INT__mp_size(mp_result1)  = 0;
 MP_INT__mp_d(mp_result1)     = 0; // (1)
 }}}

 which is handled properly by all GMP operations we make use of currently.
 This elegantly lets the very first allocation be performed within the GMP
 operation (which has better knowledge with respect to how many limbs the
 result will require)

 I've got working proof-of-concept code, which reduces the allocations the
 big-num heavy `nofib` programs (I've omitted the benchmarks which had a 0%
 allocation change):

 {{{
 --------------------------------------------------------------------------------
         Program           Size    Allocs   Runtime   Elapsed  TotalMem
 --------------------------------------------------------------------------------
 ...
      bernouilli          -0.1%     -5.2%      0.12      0.12     +0.0%
           kahan          -0.1%    -10.5%      0.17      0.17     +0.0%
       primetest          -0.1%    -12.8%      0.07      0.07     +0.0%
             rsa          -0.1%    -13.8%      0.02      0.02     +0.0%
          symalg          -0.1%     -2.3%      0.01      0.01     +0.0%
 ...
 --------------------------------------------------------------------------------
             Min          -0.1%    -13.8%     -3.1%     -3.1%     -6.5%
             Max          -0.0%     +0.4%     +4.0%     +4.0%     +1.5%
  Geometric Mean          -0.1%     -0.5%     +0.5%     +0.4%     -0.1%
 }}}

 `(1)` This should actually point to a proper static closure I didn't need
 this for the proof-of-concept

 ----

 Another place which helps avoiding temporarily allocating 1-limb
 `StgArrays` which live only for the duration of GMP FFI calls are those
 caused by the `toBig`-casts, which I'm also experimenting by making use of
 GMP's big-int/small-int add/sub/mul primitives (the deltas shown above are
 actually measured on top of this optimization), here's what to expect from
 such an optimization by itself (i.e. w/o the realloc-optimization from
 above):

 {{{
      bernouilli          +0.1%     -4.2%      0.12      0.12     +0.0%
           kahan          +0.1%    -12.6%      0.17      0.17     +0.0%
           power          +0.1%     -2.7%     +2.2%     +2.2%     +0.0%
             rsa          +0.1%     -4.1%      0.02      0.02     +0.0%
             scs          +0.1%     -2.6%     -1.6%     -1.6%     +0.0%
 }}}

 Thus, the `kahan` benchmark benefits from this optimization by -12.6%, and
 on top of that another -10.5% by also avoiding the GMP-reallocations.

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8647>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler