[Haskell-cafe] Cache miss performance costs for Haskell programs?

Tue Aug 30 16:06:34 UTC 2016

Keep in mind that OS interrupts process about 1000 times per second and 
completely invalidates cache that many times.

Greets.

On 08/30/2016 05:57 PM, Rob Stewart wrote:
> Hi Joachim,
>
> Thanks for the reply, and really interesting work!
>
>> In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed.
> Are these a count of all instructions performed at runtime? Did you
> isolate a count of just the memory access instructions that ended up
> fetching from main memory? Also, did you measure the clock cycle
> latency averages for memory access instructions for (B) (C) and (D),
> to get an indication of cache misses?
>
> I've just measured this program with perf:
>
> fib 0 = 0
> fib 1 = 1
> fib n = fib (n-1) + fib (n-2)
>
> main = print $ fib 39
>
> I.e.
>
> perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
>   fib-exe
> 63245986
>
>   Performance counter stats for 'fib-exe':
>
>         5989.527848      task-clock (msec)         #    1.268 CPUs
> utilized
>      18,825,545,684      cycles                    #    3.143 GHz
>      42,833,528,405      instructions              #    2.28  insns per
> cycle
>          70,496,109      cache-references          #   11.770 M/sec
>             161,186      cache-misses              #    0.229 % of all
> cache refs
>
>         4.725101260 seconds time elapsed
>
>
> For three runs, the number of cache misses were: 161186 (4.7 seconds),
> 80802 (4.4 seconds), and 102681 (4.5 seconds).
>
> I'm interested in hearing about Haskell developers who've used
> Valgrind, or perf, to start caring about ensuring minimising
> executable size, by injecting NOINLINE pragmas or removing INLINE
> pragmas.
>
> --
> Rob
>
>
>
> On 30 August 2016 at 16:28, Joachim Breitner <mail at joachim-breitner.de> wrote:
>> Hi Rob,
>>
>> Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
>>> I'd really like to see how badly Haskell runtimes are affected as
>>> cache hit rates decrease. Is anyone aware of any empirical studies, or
>>> papers, or blog posts, that show examples where:
>>>
>>> For the same Haskell program, increasing inlining causes lower cache
>>> hit rates, which slows down runtime due to costly cycles to fetch from
>>> main memory more often.
>> note quite inlinine, but I have had an experience like this when
>> evaluating the benefit of the Call Arity program analysis. Here is the
>> relevant excerpt from my thesis (§3.5.3, you can find the thesis at
>> http://www.joachim-breitner.de/thesis/):
>>
>>      Initially, I attempted to use the actual run time measurements, but it
>>      turned out to be a mostly pointless endeavour. For example the knights
>>      benchmark would become 9% slower when enabling Call Arity (i.e. when
>>      comparing (A) to (B)), a completely unexpected result, given that the
>>      changes to the GHC Core code were reasonable. Further investigation
>>      using performance data obtained from the CPU indicated that with the
>>      changed code, the CPU’s instruction decoder was idling for more cycles,
>>      hinting at cache effects and/or bad program layout.
>>
>>      Indeed: When I compiled the code with the compiler flag -g, which
>>      includes debugging information in the resulting binary, but should oth-
>>      erwise not affect the relative performance characteristics much, the un-
>>      expected difference vanished. I conclude that non-local changes to the
>>      Haskell or Core code will change the layout of the generated program
>>      code in unpredictable ways and render such run time measurements
>>      mostly meaningless.
>>
>>      This conclusion has been drawn before [MDHS09], and recently, tools
>>      to mitigate this effect, e.g. by randomising the code layout [CB13], were
>>      created. Unfortunately, these currently target specific C compilers, so I
>>      could not use them here.
>>
>>      In the following measurements, I avoid this problem by not measuring
>>      program execution time, but simply by counting the number of instruc-
>>      tions performed. This way, the variability in execution time due to code
>>      layout does not affect the results. To obtain the instruction counts I em-
>>      ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and
>>      thus produces more reliable and reproducible measurements.
>>
>> Greetings,
>> Joachim
>> --
>> Joachim “nomeata” Breitner
>>    mail at joachim-breitner.de • https://www.joachim-breitner.de/
>>    XMPP: nomeata at joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F
>>    Debian Developer: nomeata at debian.org
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.