[GHC] #14610: newtype wrapping of a monadic stack kills performance

Sun Dec 24 13:43:21 UTC 2017

#14610: newtype wrapping of a monadic stack kills performance
-------------------------------------+-------------------------------------
        Reporter:  mrkkrp            |                Owner:  (none)
            Type:  bug               |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  8.2.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
                                     |  Unknown/Multiple
 Type of failure:  None/Unknown      |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------
Description changed by mrkkrp:

Old description:

> I have a project where I in one module (A) I decided to build something
> like
> a minimal framework for the other bigger module (B) to use. One of the
> main
> points of the framework is that the monadic stacks are hidden behind
> `newtype`s and in those monads you can only use the functions that the
> module A provides.
>
> The module A can be found here:
>
> https://github.com/mrkkrp/mmark/blob/master/Text/MMark/Parser/Internal.hs
>
> There are two monadic stack wrapped with newtypes: `BParser` and
> `IParser`.
>
> The module B is this:
>
> https://github.com/mrkkrp/mmark/blob/master/Text/MMark/Parser.hs
>
> But it's not really relevant.
>
> Now if I change `newtype`s to `type` synonyms like so:
>
> {{{
> type BParser a = ParsecT MMarkErr Text (State BlockState) a
> type IParser a = StateT InlineState (Parsec MMarkErr Text) a
> }}}
>
> and do corresponding minor corrections, I get 2x less allocations and
> almost 2x faster code (before):
>
> {{{
> Case                                           Allocated  GCs     Max
> with file: data/bench-yaml-block.md              119,080    0  11,088
> with file: data/bench-thematic-break.md           74,368    0  10,224
> with file: data/bench-heading.md                 901,928    0   9,432
> with file: data/bench-fenced-code-block.md       145,744    0   9,368
> with file: data/bench-indented-code-block.md     124,312    0   9,368
> with file: data/bench-unordered-list.md        2,010,496    1  10,784
> with file: data/bench-ordered-list.md          2,025,016    1  10,728
> with file: data/bench-blockquote.md            1,961,672    1  42,648
> with file: data/bench-paragraph.md             2,084,104    1  42,648
> with file: data/comprehensive.md              25,899,496   24  44,200
>
> benchmarking with file: data/comprehensive.md
> time                 3.590 ms   (3.578 ms .. 3.601 ms)
>                      1.000 R²   (1.000 R² .. 1.000 R²)
> mean                 3.555 ms   (3.546 ms .. 3.565 ms)
> std dev              31.07 μs   (24.63 μs .. 39.90 μs)
> }}}
>
> After:
>
> {{{
> Case                                           Allocated  GCs     Max
> with file: data/bench-yaml-block.md              116,864    0  11,088
> with file: data/bench-thematic-break.md           64,776    0  10,392
> with file: data/bench-heading.md                 615,672    0   9,432
> with file: data/bench-fenced-code-block.md       144,736    0   9,368
> with file: data/bench-indented-code-block.md     123,352    0   9,368
> with file: data/bench-unordered-list.md          795,072    0  41,568
> with file: data/bench-ordered-list.md            808,808    0  41,512
> with file: data/bench-blockquote.md              826,392    0  41,568
> with file: data/bench-paragraph.md               881,432    0  41,568
> with file: data/comprehensive.md              10,945,104   10  44,440
>
> benchmarking with file: data/comprehensive.md
> time                 2.451 ms   (2.448 ms .. 2.456 ms)
>                      1.000 R²   (1.000 R² .. 1.000 R²)
> mean                 2.432 ms   (2.427 ms .. 2.437 ms)
> std dev              15.86 μs   (13.16 μs .. 19.01 μs)
> }}}
>
> I'm not great at reading non-trivial core, but I gave it a shot and
> dumped
> some core. One thing I noticed that core for the module A is the same in
> both cases. Then the problem is an inter-module problem, probably like
> when
> you don't dump definitions into interface files with `INLINEABLE` and end
> up
> without specialization, but here it perhaps has to do with the fact that
> B
> has no information about internals of the monadic stacks from A, but I'm
> not
> sure.
>
> Here is the beginnings of core dumps (before):
>
> {{{
> ==================== Tidy Core ====================
> 2017-12-23 04:55:17.967944505 UTC
>
> Result size of Tidy Core
>   = {terms: 210,169,
>      types: 209,675,
>      coercions: 82,492,
>      joins: 973/3,753}
> }}}
>
> After:
>
> {{{
> ==================== Tidy Core ====================
> 2017-12-23 04:58:46.386265108 UTC
>
> Result size of Tidy Core
>   = {terms: 301,256,
>      types: 263,104,
>      coercions: 28,560,
>      joins: 1,726/5,393}
> }}}
>
> So it looks like `newtype` wrapping reduces GHC's ability to create join
> points SPJ talked about recently.
>
> I do understand that you expect a minimal reproducing example but I could
> not produce it even though I spent several hours in futile attempts. I
> started by creating a small project, defining a similar stack with a
> `newtype` wrapper and using it in a simple parser in another module. Then
> I
> benchmarked the parser. There is no difference between `newtype`d and
> code
> and the code that uses just type synonyms. I tried different variations,
> no difference.
>
> The core output is too big for me to analyze, it's like 25 Mb and 33 Mb,
> and I have no idea what's going on there.
>
> You're welcome to experiment with the repo, there are benchmarks for
> memory usage and criterion benchmarks for speed of execution:
>
> https://github.com/mrkkrp/mmark
>
> If you like I can create a separate branch with newtypes replaced with
> type synonyms, so you can compare the both approaches easier.
>
> I'm submitting this because my friend convinced me that it's better to
> let you know (even though I could not create a minimal reproducing
> example on my own) than to let it go completely unnoticed.

New description:

 I have a project where I in one module (A) I decided to build something
 like
 a minimal framework for the other bigger module (B) to use. One of the
 main
 points of the framework is that the monadic stacks are hidden behind
 `newtype`s and in those monads you can only use the functions that the
 module A provides.

 The module A can be found here:

 https://github.com/mrkkrp/mmark/blob/ghc-bug-
 newtypes/Text/MMark/Parser/Internal.hs

 There are two monadic stack wrapped with newtypes: `BParser` and
 `IParser`.

 The module B is this:

 https://github.com/mrkkrp/mmark/blob/ghc-bug-newtypes/Text/MMark/Parser.hs

 But it's not really relevant.

 Now if I change `newtype`s to `type` synonyms like so:

 {{{
 type BParser a = ParsecT MMarkErr Text (State BlockState) a
 type IParser a = StateT InlineState (Parsec MMarkErr Text) a
 }}}

 and do corresponding minor corrections, I get 2x less allocations and
 almost 2x faster code (before):

 {{{
 Case                                           Allocated  GCs     Max
 with file: data/bench-yaml-block.md              119,080    0  11,088
 with file: data/bench-thematic-break.md           74,368    0  10,224
 with file: data/bench-heading.md                 901,928    0   9,432
 with file: data/bench-fenced-code-block.md       145,744    0   9,368
 with file: data/bench-indented-code-block.md     124,312    0   9,368
 with file: data/bench-unordered-list.md        2,010,496    1  10,784
 with file: data/bench-ordered-list.md          2,025,016    1  10,728
 with file: data/bench-blockquote.md            1,961,672    1  42,648
 with file: data/bench-paragraph.md             2,084,104    1  42,648
 with file: data/comprehensive.md              25,899,496   24  44,200

 benchmarking with file: data/comprehensive.md
 time                 3.590 ms   (3.578 ms .. 3.601 ms)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 3.555 ms   (3.546 ms .. 3.565 ms)
 std dev              31.07 μs   (24.63 μs .. 39.90 μs)
 }}}

 After:

 {{{
 Case                                           Allocated  GCs     Max
 with file: data/bench-yaml-block.md              116,864    0  11,088
 with file: data/bench-thematic-break.md           64,776    0  10,392
 with file: data/bench-heading.md                 615,672    0   9,432
 with file: data/bench-fenced-code-block.md       144,736    0   9,368
 with file: data/bench-indented-code-block.md     123,352    0   9,368
 with file: data/bench-unordered-list.md          795,072    0  41,568
 with file: data/bench-ordered-list.md            808,808    0  41,512
 with file: data/bench-blockquote.md              826,392    0  41,568
 with file: data/bench-paragraph.md               881,432    0  41,568
 with file: data/comprehensive.md              10,945,104   10  44,440

 benchmarking with file: data/comprehensive.md
 time                 2.451 ms   (2.448 ms .. 2.456 ms)
                      1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 2.432 ms   (2.427 ms .. 2.437 ms)
 std dev              15.86 μs   (13.16 μs .. 19.01 μs)
 }}}

 I'm not great at reading non-trivial core, but I gave it a shot and dumped
 some core. One thing I noticed that core for the module A is the same in
 both cases. Then the problem is an inter-module problem, probably like
 when
 you don't dump definitions into interface files with `INLINEABLE` and end
 up
 without specialization, but here it perhaps has to do with the fact that B
 has no information about internals of the monadic stacks from A, but I'm
 not
 sure.

 Here is the beginnings of core dumps (before):

 {{{
 ==================== Tidy Core ====================
 2017-12-23 04:55:17.967944505 UTC

 Result size of Tidy Core
   = {terms: 210,169,
      types: 209,675,
      coercions: 82,492,
      joins: 973/3,753}
 }}}

 After:

 {{{
 ==================== Tidy Core ====================
 2017-12-23 04:58:46.386265108 UTC

 Result size of Tidy Core
   = {terms: 301,256,
      types: 263,104,
      coercions: 28,560,
      joins: 1,726/5,393}
 }}}

 So it looks like `newtype` wrapping reduces GHC's ability to create join
 points SPJ talked about recently.

 I do understand that you expect a minimal reproducing example but I could
 not produce it even though I spent several hours in futile attempts. I
 started by creating a small project, defining a similar stack with a
 `newtype` wrapper and using it in a simple parser in another module. Then
 I
 benchmarked the parser. There is no difference between `newtype`d and code
 and the code that uses just type synonyms. I tried different variations,
 no difference.

 The core output is too big for me to analyze, it's like 25 Mb and 33 Mb,
 and I have no idea what's going on there.

 You're welcome to experiment with the repo, there are benchmarks for
 memory usage and criterion benchmarks for speed of execution:

 https://github.com/mrkkrp/mmark

 I have created two branches I won't touch, one with newtypes and another
 one with type synonyms:

 * https://github.com/mrkkrp/mmark/tree/ghc-bug-newtypes
 * https://github.com/mrkkrp/mmark/tree/ghc-bug-type-synonyms

 Just checkout one of these and run `stack bench`.

 This is the commit that changes newtypes to type synonyms:

 https://github.com/mrkkrp/mmark/commit/759d8d4aa52dd57a393299c63e8c9b70d0d43290

 I'm submitting this because my friend convinced me that it's better to let
 you know (even though I could not create a minimal reproducing example on
 my own) than to let it go completely unnoticed.

--

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/14610#comment:1>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler