<div dir="ltr"><div>Ben, Eric, thanks for your help. A whole new world of low-level statistics opens up before me...</div><div><br></div><div>On 11 May 2017 at 18:57, Ben Gamari <<a href="mailto:ben@smart-cactus.org">ben@smart-cactus.org</a>> wrote:</div><div>> Eric Seidel <<a href="mailto:eric@seidel.io">eric@seidel.io</a>> writes:</div><div>> </div><div>> > On Thu, May 11, 2017, at 09:12, David Turner wrote:</div><div>> >> Could anyone help, e.g. by pointing me at the bit in the Core that is</div><div>> >> allocating within the main loop?</div><div>> ></div><div>> > GHC has a -ticky flag that tracks precisely where allocations are</div><div>> > happening. It's quite low level (you may have to stare at the STG in</div><div>> > addition to the Core to interpret the results) but I've found it</div><div>> > incredibly useful when trying to understand performance swings in GHC's</div><div>> > own benchmarks.</div><div>> ></div><div>> > <a href="https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky">https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky</a></div><div>> ></div><div>> > You can enable it per module, which is nice break from the</div><div>> > rebuild-the-world -prof, but the flip side IIRC is that if the</div><div>> > allocations are happening in a module that wasn't compiled with -ticky</div><div>> > (e.g. somewhere inside Data.Array) you might not see them.</div><div>> </div><div>> Indeed, -ticky is wonderful for this sort of thing since it doesn't</div><div>> affect Core-to-Core optimization, meaning that the instrumented program</div><div>> will behave very similarly to the uninstrumented version (just a bit</div><div>> slower due to counter overhead).</div><div>> </div><div>> I had a quick look at your example over lunch. I started like this</div><div>> (after commenting out the aeson benchmark in Main), </div><div>> </div><div>> $ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \</div><div>> -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \</div><div>> -ticky -ticky-allocd</div><div>> $ app/Main +RTS -rticky.out</div><div><br></div><div><br></div><div>Ok, I was just about to ask about the Alloc'd column as mine was all zeroes, but the -ticky-allocd was what I needed, thanks.</div><div><br></div><div>> The table suggests a few nice starting points: $w$j1_slJT is allocated</div><div>> quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output</div><div>> (src/Automaton.dump-stg) to try to understand why this is. Reading STG</div><div>> generally tends to be quite hard due to the rather large, deeply nested</div><div>> trees that most programs compile to.</div><div><br></div><div>You're not wrong.</div><div><br></div><div>I've trawled through the STG as best as I can for a newbie. The function with a large number in the 'Alloc' column looks pretty much to be the heart of the `step` function. There's a lot of nesting but it looks largely to be just checking the bounds on array indices, which I don't think allocates anything.</div><div><br></div><div>However I see this this little snippet at the very bottom of the nesting:</div><div><br></div><div>case</div><div> indexArray# [ds1_sl1b</div><div> y2_sl5H]</div><div>of</div><div>_ [Occ=Dead]</div><div>{ (##) ipv8_sl60 [Occ=Once!] -></div><div> case</div><div> ipv8_sl60</div><div> of</div><div> _ [Occ=Dead]</div><div> { GHC.Word.W8# dt6_sl62 [Occ=Once] -></div><div> case</div><div> ss_sl5v</div><div> of</div><div> dt7_sl63</div><div> { __DEFAULT -></div><div> (#,,#) [__word 1</div><div> dt6_sl62</div><div> dt7_sl63];</div><div> };</div><div> };</div><div>};</div><div><br></div><div><br></div><div>I'm guessing that the `indexArray` call is the line that reads `nextState = aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me it looks like it might be unboxing the thing it's getting out of the array. Not sure that counts as an allocation, but I'm surprised anyway.</div><div><br></div><div>I'll keep digging anyway, but it'd be good to hear of any progress at your end too. Hope it didn't spoil your lunch!</div><div><br></div><div>Cheers,</div><div><br></div><div>> While turning this over in my head I realized that this is the sort of</div><div>> program which may be helped significantly by GHC 8.2's improved join</div><div>> point handling. Indeed the timings seem to support this hypothesis:</div><div>> </div><div>> GHC 8.0.2</div><div>> </div><div>> benchmarking json-validator/Automaton/testEvent</div><div>> time 22.84 μs (22.76 μs .. 22.94 μs)</div><div>> 1.000 R² (1.000 R² .. 1.000 R²)</div><div>> mean 22.84 μs (22.76 μs .. 22.94 μs)</div><div>> std dev 297.4 ns (221.8 ns .. 378.8 ns)</div><div>> </div><div>> Alloc rate 4,015,256,078,038 bytes per MUT second</div><div>> </div><div>> GHC 8.2</div><div>> </div><div>> benchmarking json-validator/Automaton/testEvent</div><div>> time 9.221 μs (9.141 μs .. 9.318 μs)</div><div>> 0.998 R² (0.996 R² .. 1.000 R²)</div><div>> mean 9.163 μs (9.084 μs .. 9.356 μs)</div><div>> std dev 399.8 ns (193.0 ns .. 745.4 ns)</div><div>> variance introduced by outliers: 54% (severely inflated)</div><div>> </div><div>> Alloc rate 123,141,635 bytes per MUT second</div><div>> </div><div>> </div><div>> Wow! I suspect your allocations have now vanished.</div><div><br></div><div>Ooo, that's more like it.</div><div><br></div><div>Could you run again using the following to get Criterion's estimate of the allocations-per-call?</div><div><br></div><div>json-validator-exe --regress allocated:iters +RTS -T</div><div><br></div><div>Cheers,</div><div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 11 May 2017 at 19:45, David Turner <span dir="ltr"><<a href="mailto:dave.c.turner@gmail.com" target="_blank">dave.c.turner@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5">On 11 May 2017 at 19:40, Ben Gamari <span dir="ltr"><<a href="mailto:ben@smart-cactus.org" target="_blank">ben@smart-cactus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Ccing Luke Maurer under the assumption that he will appreciate seeing<br>
the fruits of his labor.<br>
<br>
<br>
David Turner <<a href="mailto:dct25-561bs@mythic-beasts.com" target="_blank">dct25-561bs@mythic-beasts.com</a><wbr>> writes:<br>
<br>
> Dear Café,<br>
><br>
(snip)<br>
<span class="m_-5351640326115643737gmail-">><br>
> There's a copy of the relevant code for option 4 at<br>
> <a href="https://github.com/DaveCTurner/json-validator" rel="noreferrer" target="_blank">https://github.com/DaveCTurner<wbr>/json-validator</a>. I've hacked around with it a<br>
> bit since producing the numbers above, so it's now apparently a bit slower<br>
> than Aeson but allocates less too (~65kB).<br>
><br>
> Could anyone help, e.g. by pointing me at the bit in the Core that is<br>
> allocating within the main loop?<br>
><br>
</span>While turning this over in my head I realized that this is the sort of<br>
program which may be helped significantly by GHC 8.2's improved join<br>
point handling. Indeed the timings seem to support this hypothesis:<br>
<br>
GHC 8.0.2<br>
<br>
benchmarking json-validator/Automaton/testE<wbr>vent<br>
time 22.84 μs (22.76 μs .. 22.94 μs)<br>
1.000 R² (1.000 R² .. 1.000 R²)<br>
mean 22.84 μs (22.76 μs .. 22.94 μs)<br>
std dev 297.4 ns (221.8 ns .. 378.8 ns)<br>
<br>
Alloc rate 4,015,256,078,038 bytes per MUT second<br>
<br>
GHC 8.2<br>
<br>
benchmarking json-validator/Automaton/testE<wbr>vent<br>
time 9.221 μs (9.141 μs .. 9.318 μs)<br>
0.998 R² (0.996 R² .. 1.000 R²)<br>
mean 9.163 μs (9.084 μs .. 9.356 μs)<br>
std dev 399.8 ns (193.0 ns .. 745.4 ns)<br>
variance introduced by outliers: 54% (severely inflated)<br>
<br>
Alloc rate 123,141,635 bytes per MUT second<br>
<br>
<br>
Wow! I suspect your allocations have now vanished.<br></blockquote><div><br></div><div><br></div></div></div><div>Ooo, that's more like it.</div><div><br></div><div>Could you run again using the following to get Criterion's estimate of the allocations-per-call?</div><div><br></div><div><span style="color:rgb(36,41,46);font-family:sfmono-regular,consolas,"liberation mono",menlo,courier,monospace;font-size:13.6px;background-color:rgba(27,31,35,0.05)">json-validator-exe --regress allocated:iters +RTS -T</span><br></div><div><br></div><div>Cheers,</div><div><br></div><div>David</div><div><br></div></div></div></div>
</blockquote></div><br></div>