FW: [Haskell-cafe] The RTSOPTS "-qm" flag's impact on runtime

Tue Oct 1 09:46:14 UTC 2013

Simon: did you see this?   A factor of 50 in runtime seems pretty significant!

Simon

-----Original Message-----
From: Haskell-Cafe [mailto:haskell-cafe-bounces at haskell.org] On Behalf Of Iustin Pop
Sent: 30 September 2013 23:14
To: Haskell Cafe
Subject: [Haskell-cafe] The RTSOPTS "-qm" flag's impact on runtime

Hi all,

I found an interesting case where the rtsopts -qm flag makes a
significant difference in runtime (~50x). This is using GHC 7.6.3, llvm 3.4, program
compiled with "-threaded -O2 -fllvm" and a couple of language extension.
Source is at
http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?test=chameneosredux&lang=ghc&id=4&data=u64q,
on the language shootout benchmarks.

Running the code without -N results (on my computer) in around 4 seconds
of runtime:
$ time ./orig 6000000
?
real    0m3.919s
user    0m3.903s
sys     0m0.010s

This is reasonably consistent. Running -N4 (this is an 8-core machine)
results in the surprising:

$ time ./orig 6000000 +RTS -N4
?
real    1m15.154s
user    1m38.790s
sys     2m7.947s

The cores are all used very erratically (continuously changing
5%-20%-40%) and the overall cpu usage is ~27-28%. Note the surprising
2m7s of sys usage, which means the kernel is involved a lot?

Note that removing the explicit forkOn and running with -N4 results in
somewhat worse performance:

real    2m6.548s
user    2m13.470s
sys     2m3.043s

So in that sense the forkOn itself is not at fault. What I have found is
that -qm is here a life saver:

$ time ./orig 6000000 +RTS -N4 -qm
real    0m2.773s
user    0m5.610s
sys     0m0.123s

Adding -qa doesn't make a big difference. To summarise more runs (in
terms of cpu used, user+sys):

with forkOn:
  - -N4:         228s
  - -N4 -qa:     110s
  - -N4 -qm:       6s
  - -N4 -qm -qa:   6s

without forkOn:
  - -N4:         253s
  - -N4 -qa:     252s
  - -N4 -qm:       5s
  - -N4 -qm -qa:   5s

(Note that "without forkOn" is a bit slower in term of wall-clock, as
the "with forkOn" version distributes the work a bit better, even if it
uses overall a tiny bit more CPU.)

So the question is, what does -qm actually do that it affects this
benchmark so much (~50x)? (The docs are not very clear on it)

And furthermore, could there be an heuristic inside the runtime such
that automatic thread migration is suspended if threads are
"over-migrated" (which is what I suppose happens here)?

thanks for any explanations,
iustin
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe at haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe