[Haskell-cafe] Cache miss performance costs for Haskell programs?

Tue Aug 30 15:57:08 UTC 2016

Hi Joachim,

Thanks for the reply, and really interesting work!

> In the following measurements, I avoid this problem by not measuring program execution time, but simply by counting the number of instructions performed.

Are these a count of all instructions performed at runtime? Did you
isolate a count of just the memory access instructions that ended up
fetching from main memory? Also, did you measure the clock cycle
latency averages for memory access instructions for (B) (C) and (D),
to get an indication of cache misses?

I've just measured this program with perf:

fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

main = print $ fib 39

I.e.

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
 fib-exe
63245986

 Performance counter stats for 'fib-exe':

       5989.527848      task-clock (msec)         #    1.268 CPUs
utilized
    18,825,545,684      cycles                    #    3.143 GHz
    42,833,528,405      instructions              #    2.28  insns per
cycle
        70,496,109      cache-references          #   11.770 M/sec
           161,186      cache-misses              #    0.229 % of all
cache refs

       4.725101260 seconds time elapsed

For three runs, the number of cache misses were: 161186 (4.7 seconds),
80802 (4.4 seconds), and 102681 (4.5 seconds).

I'm interested in hearing about Haskell developers who've used
Valgrind, or perf, to start caring about ensuring minimising
executable size, by injecting NOINLINE pragmas or removing INLINE
pragmas.

--
Rob

On 30 August 2016 at 16:28, Joachim Breitner <mail at joachim-breitner.de> wrote:
> Hi Rob,
>
> Am Dienstag, den 30.08.2016, 16:23 +0100 schrieb Rob Stewart:
>> I'd really like to see how badly Haskell runtimes are affected as
>> cache hit rates decrease. Is anyone aware of any empirical studies, or
>> papers, or blog posts, that show examples where:
>>
>> For the same Haskell program, increasing inlining causes lower cache
>> hit rates, which slows down runtime due to costly cycles to fetch from
>> main memory more often.
>
> note quite inlinine, but I have had an experience like this when
> evaluating the benefit of the Call Arity program analysis. Here is the
> relevant excerpt from my thesis (§3.5.3, you can find the thesis at
> http://www.joachim-breitner.de/thesis/):
>
>     Initially, I attempted to use the actual run time measurements, but it
>     turned out to be a mostly pointless endeavour. For example the knights
>     benchmark would become 9% slower when enabling Call Arity (i.e. when
>     comparing (A) to (B)), a completely unexpected result, given that the
>     changes to the GHC Core code were reasonable. Further investigation
>     using performance data obtained from the CPU indicated that with the
>     changed code, the CPU’s instruction decoder was idling for more cycles,
>     hinting at cache effects and/or bad program layout.
>
>     Indeed: When I compiled the code with the compiler flag -g, which
>     includes debugging information in the resulting binary, but should oth-
>     erwise not affect the relative performance characteristics much, the un-
>     expected difference vanished. I conclude that non-local changes to the
>     Haskell or Core code will change the layout of the generated program
>     code in unpredictable ways and render such run time measurements
>     mostly meaningless.
>
>     This conclusion has been drawn before [MDHS09], and recently, tools
>     to mitigate this effect, e.g. by randomising the code layout [CB13], were
>     created. Unfortunately, these currently target specific C compilers, so I
>     could not use them here.
>
>     In the following measurements, I avoid this problem by not measuring
>     program execution time, but simply by counting the number of instruc-
>     tions performed. This way, the variability in execution time due to code
>     layout does not affect the results. To obtain the instruction counts I em-
>     ploy valgrind [NS07], which runs the benchmarks on a virtual CPU and
>     thus produces more reliable and reproducible measurements.
>
> Greetings,
> Joachim
> --
> Joachim “nomeata” Breitner
>   mail at joachim-breitner.de • https://www.joachim-breitner.de/
>   XMPP: nomeata at joachim-breitner.de • OpenPGP-Key: 0xF0FBF51F
>   Debian Developer: nomeata at debian.org
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.