[GHC] #8224: Excessive system time -- new IO manager problem?
GHC
ghc-devs at haskell.org
Wed Nov 13 15:03:02 UTC 2013
#8224: Excessive system time -- new IO manager problem?
-------------------------------------+-------------------------------------
Reporter: rrnewton | Owner:
Type: bug | Status: new
Priority: high | Milestone: 7.8.1
Component: Runtime System | Version: 7.7
Resolution: | Keywords: IO Manager,
Operating System: Linux | System Time
Type of failure: Runtime | Architecture: x86_64 (amd64)
performance bug | Difficulty: Unknown
Test Case: | Blocked By:
Blocking: | Related Tickets:
-------------------------------------+-------------------------------------
Comment (by AndreasVoellmy):
One interesting thing I just noticed is that `taskset` makes a big
difference. `excessive_system_2_HEAD` is the empty program compiled with
HEAD and `excessive_system_2_ONEACTIVELOOP` is the empty program compiled
with only one IO manager loop being started. We see that for
`excessive_system_2_HEAD` if we use `taskset` to restrict the threads to
run on CPUs 0-9 (all on one NUMA node) we get a big difference:
{{{
$ time ./excessive_system_2_HEAD +RTS -N32
real 0m0.046s
user 0m0.020s
sys 0m0.188s
}}}
{{{
$ time taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32
real 0m0.025s
user 0m0.000s
sys 0m0.056s
}}}
`excessive_system_2_ONEACTIVELOOP` also performs better with taskset but
the difference is much smaller. Without taskset I see system tim from
30ms-60ms and with taskset it is about 10-20ms.
So maybe the explanation is that with 31 extra safe foreign calls (from
calls to `epoll_wait`) we get about 31 extra OS threads and if we are
unlucky and the OS schedules these on different NUMA nodes we will get
cache issues slowing down. And finally a spin lock burns longer accounting
for more system time?
Here is a run of `excessive_system_2_HEAD` using taskset to restrict to
first NUMA node or no restrictions with perf stat:
{{{
$ perf_3.2.0-25 stat taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32
Performance counter stats for 'taskset -c 0-9 ./excessive_system_2_HEAD
+RTS -N32':
25.019318 task-clock # 0.900 CPUs utilized
584 context-switches # 0.023 M/sec
218 CPU-migrations # 0.009 M/sec
1,248 page-faults # 0.050 M/sec
36,387,323 cycles # 1.454 GHz
[62.17%]
26,057,951 stalled-cycles-frontend # 71.61% frontend cycles
idle [98.77%]
17,556,988 stalled-cycles-backend # 48.25% backend cycles
idle
22,358,612 instructions # 0.61 insns per cycle
# 1.17 stalled cycles per
insn
4,300,427 branches # 171.884 M/sec
169,950 branch-misses # 3.95% of all branches
[42.87%]
0.027811151 seconds time elapsed
}}}
{{{
$ perf_3.2.0-25 stat taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32
Performance counter stats for 'taskset -c 0-79 ./excessive_system_2_HEAD
+RTS -N32':
157.684408 task-clock # 3.055 CPUs utilized
588 context-switches # 0.004 M/sec
147 CPU-migrations # 0.001 M/sec
1,245 page-faults # 0.008 M/sec
267,826,389 cycles # 1.698 GHz
[54.94%]
227,802,954 stalled-cycles-frontend # 85.06% frontend cycles
idle [66.06%]
184,677,238 stalled-cycles-backend # 68.95% backend cycles
idle [86.81%]
70,842,584 instructions # 0.26 insns per cycle
# 3.22 stalled cycles per
insn [98.87%]
16,579,295 branches # 105.142 M/sec
[96.70%]
233,921 branch-misses # 1.41% of all branches
[97.31%]
0.051618204 seconds time elapsed
}}}
It is interesting that without restriction we get many more instructions
executed and also fewer instructions per cycle.
Also interesting, here are the top few entries perf recording with taskset
0-9
{{{
5.64% excessive_syste [kernel.kallsyms] [k]
update_sd_lb_stats
5.47% excessive_syste [kernel.kallsyms] [k] resched_task
4.97% excessive_syste excessive_system_2_HEAD [.] evacuate1
4.84% excessive_syste [kernel.kallsyms] [k]
__ticket_spin_lock
4.18% excessive_syste libc-2.15.so [.] _int_malloc
3.85% excessive_syste [kernel.kallsyms] [k]
__init_waitqueue_head
3.81% excessive_syste [kernel.kallsyms] [k] dequeue_entity
3.48% excessive_syste [kernel.kallsyms] [k] find_vma
3.14% excessive_syste [kernel.kallsyms] [k]
native_write_msr_safe
3.12% excessive_syste [kernel.kallsyms] [k] source_load
3.12% excessive_syste [kernel.kallsyms] [k] idle_cpu
2.62% excessive_syste [kernel.kallsyms] [k]
update_cfs_rq_blocked_load
2.33% excessive_syste [kernel.kallsyms] [k]
perf_event_mmap_ctx
2.13% excessive_syste [kernel.kallsyms] [k]
kmem_cache_alloc
2.07% excessive_syste excessive_system_2_HEAD [.] schedule
}}}
and with no restriction
{{{
43.82% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock
6.44% excessive_syste [kernel.kallsyms] [k]
default_send_IPI_mask_sequence_phys
4.68% excessive_syste [kernel.kallsyms] [k]
native_write_msr_safe
3.45% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats
3.31% excessive_syste libc-2.15.so [.] new_heap
2.86% excessive_syste [kernel.kallsyms] [k] idle_cpu
2.40% excessive_syste [kernel.kallsyms] [k] perf_event_mmap
1.61% excessive_syste [kernel.kallsyms] [k] __schedule
1.60% excessive_syste excessive_system_2_HEAD [.] s8Bc_info
1.56% excessive_syste [kernel.kallsyms] [k]
inherit_event.isra.73
1.50% excessive_syste [kernel.kallsyms] [k]
__init_waitqueue_head
1.49% excessive_syste libc-2.15.so [.] __clone
1.45% excessive_syste [kernel.kallsyms] [k] clear_page_c
1.44% excessive_syste [kernel.kallsyms] [k] try_to_wake_up
}}}
Pretty clearly spending more time in a spin lock when we run on all NUMA
nodes.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8224#comment:17>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list