[Haskell-cafe] Cache miss performance costs for Haskell programs?
Branimir Maksimovic
branimir.maksimovic at gmail.com
Tue Aug 30 16:06:34 UTC 2016
Keep in mind that OS interrupts process about 1000 times per second and
completely invalidates cache that many times.
Greets.
On 08/30/2016 05:57 PM, Rob Stewart wrote:
> Hi Joachim,
>
> Thanks for the reply, and really interesting work!
>
>> In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed.
> Are these a count of all instructions performed at runtime? Did you
> isolate a count of just the memory access instructions that ended up
> fetching from main memory? Also, did you measure the clock cycle
> latency averages for memory access instructions for (B) (C) and (D),
> to get an indication of cache misses?
>
> I've just measured this program with perf:
>
> fib 0 = 0
> fib 1 = 1
> fib n = fib (n-1) + fib (n-2)
>
> main = print $ fib 39
>
> I.e.
>
> perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
> fib-exe
> 63245986
>
> Performance counter stats for 'fib-exe':
>
> 5989.527848 task-clock (msec) # 1.268 CPUs
> utilized
> 18,825,545,684 cycles # 3.143 GHz
> 42,833,528,405 instructions # 2.28 insns per
> cycle
> 70,496,109 cache-references # 11.770 M/sec
> 161,186 cache-misses # 0.229 % of all
> cache refs
>
> 4.725101260 seconds time elapsed
>
>
> For three runs, the number of cache misses were: 161186 (4.7 seconds),
> 80802 (4.4 seconds), and 102681 (4.5 seconds).
>
> I'm interested in hearing about Haskell developers who've used
> Valgrind, or perf, to start caring about ensuring minimising
> executable size, by injecting NOINLINE pragmas or removing INLINE
> pragmas.
>
> --
> Rob
>
>
>
> On 30 August 2016 at 16:28, Joachim Breitner <mail at joachim-breitner.de> wrote:
>> Hi Rob,
>>
>> Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
>>> I'd really like to see how badly Haskell runtimes are affected as
>>> cache hit rates decrease. Is anyone aware of any empirical studies, or
>>> papers, or blog posts, that show examples where:
>>>
>>> For the same Haskell program, increasing inlining causes lower cache
>>> hit rates, which slows down runtime due to costly cycles to fetch from
>>> main memory more often.
>> note quite inlinine, but I have had an experience like this when
>> evaluating the benefit of the Call Arity program analysis. Here is the
>> relevant excerpt from my thesis (§3.5.3, you can find the thesis at
>> http://www.joachim-breitner.de/thesis/):
>>
>> Initially, I attempted to use the actual run time measurements, but it
>> turned out to be a mostly pointless endeavour. For example the knights
>> benchmark would become 9% slower when enabling Call Arity (i.e. when
>> comparing (A) to (B)), a completely unexpected result, given that the
>> changes to the GHC Core code were reasonable. Further investigation
>> using performance data obtained from the CPU indicated that with the
>> changed code, the CPU’s instruction decoder was idling for more cycles,
>> hinting at cache effects and/or bad program layout.
>>
>> Indeed: When I compiled the code with the compiler flag -g, which
>> includes debugging information in the resulting binary, but should oth-
>> erwise not affect the relative performance characteristics much, the un-
>> expected difference vanished. I conclude that non-local changes to the
>> Haskell or Core code will change the layout of the generated program
>> code in unpredictable ways and render such run time measurements
>> mostly meaningless.
>>
>> This conclusion has been drawn before [MDHS09], and recently, tools
>> to mitigate this effect, e.g. by randomising the code layout [CB13], were
>> created. Unfortunately, these currently target specific C compilers, so I
>> could not use them here.
>>
>> In the following measurements, I avoid this problem by not measuring
>> program execution time, but simply by counting the number of instruc-
>> tions performed. This way, the variability in execution time due to code
>> layout does not affect the results. To obtain the instruction counts I em-
>> ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and
>> thus produces more reliable and reproducible measurements.
>>
>> Greetings,
>> Joachim
>> --
>> Joachim “nomeata” Breitner
>> mail at joachim-breitner.de • https://www.joachim-breitner.de/
>> XMPP: nomeata at joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F
>> Debian Developer: nomeata at debian.org
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
More information about the Haskell-Cafe
mailing list