Summary of containers patches

Thu Sep 23 20:22:51 EDT 2010

Hi Simon, others,

here comes the results.

I consider 4 variants of containers:
inline: the current fox.auryn.cz/darcs/containers with many INLINE pragmas
nothing: all INLINE pragmas deleted
inlinable: all INLINE pragmas replaced with INLINABLE
mix: all INLINE pragmas replaced with INLINABLE, but lookup*, member*
   and insert* functions are kept INLINE.

At first code bloat. Writing size of the binaries in bytes, all binaries
are stripped, I386.

* map-properties - binary running test on Data.Map, uses many functions
  from Data.Map
  - inline: 5266740
  - nothing: 2325940
  - inlinable: 2348116
  - mix: 2571156
  That is 2.9MB increase in size from nothing to inline.

* Map.hs - benchmark of Data.Map, using several functions only
  - inline: 24465452
  - nothing: 24302348
  - inlinable: 24326924
  - mix: 24417100
  160kB only from nothing to inline.

Another code bloat. This time I compiled
ghc-7.0: GHC itself with containers containing INLINE pragmas
  20723280B
ghc-7.0-noinlinemap: GHC itself with containers without INLINE pragmas
  19965464B
That is 740kB difference between ghc-7.0 binaries.
(There is 1082kB difference between libHSghc-7.0.0.a libraries.)

Another code bloat. This time Map.hs benchmark again. All packages
needed for the benchmark (most notably criterion) are compiled:
- against containers package contains INLINE pragmas
  24465452B
- against containers package without INLINE pragmas
  23753808B
That is 700kB difference _on one binary_.

So the code bloat is considerable.

Now the timing results. The results are the change in runtime in %.
[1] speedup between inline and noinline (negative numbers means slowdown)
[2] speedup between inline and inlinable
[3] speedup between inline and mix
(the results are += several %, I did not have time to do it better)

                            [1]        [2]        [3]
alter                       -50.33     -49.17     -47.51
delete                      -46.67     -50.93     -48.79
difference                  1.93       0.18       6.33
insert                      -44.11     -44.25     -3.09
insertLookupWithKey empty   -14.62     -15.70     1.22
insertLookupWithKey update  -21.54     -19.23     6.20
insertLookupWithKey' empty  -16.79     -23.67     -2.82
insertLookupWithKey' update -26.62     -26.99     0.47
insertWith empty            -44.42     -48.29     -5.88
insertWith update           -43.28     -45.03     2.50
insertWith' empty           -45.30     -46.66     -3.45
insertWith' update          -38.15     -38.27     3.94
insertWithKey empty         -42.60     -45.58     -6.99
insertWithKey update        -40.13     -36.62     -1.71
insertWithKey' empty        -46.54     -46.53     -4.22
insertWithKey' update       -43.10     -42.16     0.98
intersection                1.42       -1.75      0.37
lookup                      -94.03     -94.62     6.14
lookupIndex                 -86.25     -86.92     4.91
union                       -0.04      -0.87      2.12
update                      -50.64     -55.27     -54.17
updateLookupWithKey         -43.90     -43.31     -49.22

So, the INLINABLE does not help at all.

I in fact expected this outcome -- it is not inlining itself that is
beneficial, it is the _specialization_ (which happens during inlining if
possible).

Please bear in mind that the speedup caused by specialization is
significant only in the case when the Ord.compare function is trivial
(if it is an instruction like <=, the result between instruction call
and a dynamic dictionary call is very big). That means things like Map
Int, Map Word, Map Int8 and so on. But with these types one usually uses
IntMap. What I am saying is that the usage of Map which would benefit
from specialization may not be very common.

At this point I suggest the following:
- as immediate solution, remove all INLINE from the containers package,
  just leave INLINE in lookup*, member* and insert* functions. This
  causes small code bloat and quite big wins on these functions, which
  I consider to be very frequent.

  I could imagine to remove also INLINE for insert* functions, or to add
  INLINE for delete*, update* and alter*.

- in the long term, find a way to get back the speedup without the code
  bloat. The problem is that _every_ call to a lookup has a different
  code in the binary. We need to share the code of specialized functions
  (like the c++ templates do). I mean the following:
  - before linking, find out which dictionaries are passed to
    Data.Map.lookup method. Specialize Data.Map.lookup for these types
    (only once for each type) and change the calls to lookup with
    specific dictionaries.
  The effect should be as if user said SPECIALIZE, for each method
  only for types it is non-generically being used for.

- I would not bother with INLINABLE currently.

Simon, others, do you agree with my short-term solution? If you do,
I will push the patches (I need to do it before I fly to ICFP, which
will be in 30 hours).

Cheers,
Milan

>> as you probably know, there are several containers patches dealing with
>> performance, separate testsuite and benchmarking.
>>
>> Specifically, I mean tickets
>> http://hackage.haskell.org/trac/ghc/ticket/4277
>> http://hackage.haskell.org/trac/ghc/ticket/4279
>> http://hackage.haskell.org/trac/ghc/ticket/4280
>> http://hackage.haskell.org/trac/ghc/ticket/4311
>> http://hackage.haskell.org/trac/ghc/ticket/4312
>> http://hackage.haskell.org/trac/ghc/ticket/4333
>>
>> I have all of them in the following repository
>> http://fox.auryn.cz/darcs/containers/
>>
>> There are no user-visible changes, just performance and the split
>> testsuite.
>>
>> There were no comments in some time, so I would like to push them to
>> containers.
>>
>> There is just the code-bloat issue to solve. The excessive use of INLINE
>> pragma causes a large code bloat. I am aware of it, it happened even
>> when I was benchmarking the containers.
>>
>> Personally I vote for:
>> - changing INLINE to INLINABLE
>> - leave INLINE just in
>>    - member/lookup functions
>>    - insert* functions
>>    as these benefit from specialization a lot and do not cause such
>>    a bloat (measured during the internship, do not recall the numbers
>>    now).
>>
>> I suspect the performance to drop a bit, but I think it is better to
>> have a bit slower containers without such a code bloat.
>>
>> Simon, I will be flying to IFCP soon, but if you want, I can prepare the
>> INLINE->INLINABLE patches and put them in the repo.
>
> That would help a lot, thanks.  I started on this today but I'm a bit  
> overloaded, and getting criterion installed wasn't very smooth (a couple  
> of package breakages, but nothing serious and patches are upstream).
>
> Cheers,
> 	Simon
>