[GHC] #7602: Threaded RTS performing badly on recent OS X (10.8?)
GHC
cvs-ghc at haskell.org
Sun Feb 10 04:07:34 CET 2013
#7602: Threaded RTS performing badly on recent OS X (10.8?)
---------------------------------+------------------------------------------
Reporter: simonmar | Owner:
Type: bug | Status: new
Priority: normal | Milestone: _|_
Component: Runtime System | Version: 7.6.1
Keywords: | Os: Unknown/Multiple
Architecture: Unknown/Multiple | Failure: None/Unknown
Difficulty: Unknown | Testcase:
Blockedby: | Blocking:
Related: |
---------------------------------+------------------------------------------
Comment(by thoughtpolice):
Alright, I think my patch is almost working, but in the mean time I've
verified with a small snippet the behavior I think we want. Simon, can you
please tell me if this approach would be OK?
Essentially, there is a small set of predefined TLS keys in the OS X C
library for various Apple-internal things. There are about 100 of these
special keys. With them, it's possible to use very special inline variants
of ```pthread_getspecific``` and ```pthread_setspecific``` that directly
write into an offset block of the ```%gs``` register. Performance-wise,
this should be very close to Linux's implementation.
One of these things on modern OS X and its libc is WebKit. pthread has a
specific range of keys (5 to be exact) dedicated to WebKit. These are used
in JavaScriptCore's FastMalloc allocator for performance critical sections
- likely for their GC! But only a single key is used by WebKit at all, and
there are 0 references to it elsewhere that I can find on the internet.
You can see this here:
http://www.opensource.apple.com/source/Libc/Libc-825.25/pthreads/pthread_machdep.h
This defines the inline get/set routines for special TLS keys. If you
scroll down a little you can see the ```JavaScriptCore``` keys (keys 90-94
to be exact.)
Now, look here:
http://code.google.com/codesearch#mcaWan7Aaio/trunk/WebKit-r115846/Source/WTF/wtf/FastMalloc.cpp&q=__PTK_FRAMEWORK_JAVASCRIPTCORE_KEY0&type=cs&l=453
And you can see there's a special stubbed out ```pthread_getspecific```
and ```pthread_setspecific``` routine for this exact purpose.
Therefore, I propose we steal one of the high TLS keys that dedicated to
WebKit's JS engine for the GC. Unfortunately, ```pthread_machdep.h``` is
not installed by default in modern variants of XCode, so we must inline
the definitions ourselves for the necessary architectures.
The following example demonstrates the use of these special keys:
{{{
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
/** Snipped from pthread_machdep.h */
#define __PTK_FRAMEWORK_JAVASCRIPTCORE_KEY4 94
__inline__ void *
_pthread_getspecific_direct(unsigned long slot) {
void* ret;
#if defined(__i386__) || defined(__x86_64__)
__asm__("mov %%gs:%1, %0" : "=r" (ret) : "m" (*(void **)(slot *
sizeof(void *))));
#else
#error "No definition of pthread_getspecific_direct!"
#endif
return ret;
}
/* To be used with static constant keys only */
__inline__ static int
_pthread_setspecific_direct(unsigned long slot, void * val)
{
#if defined(__x86_64__)
/* PIC is free and cannot be disabled, even with: gcc -mdynamic-no-pic
... */
__asm__("movq %1,%%gs:%0" : "=m" (*(void **)(slot * sizeof(void *))) :
"rn" (val));
#else
#error "No definition of pthread_setspecific_direct!"
#endif
return 0;
}
/** End snippets */
static const pthread_key_t fooKey =
__PTK_FRAMEWORK_JAVASCRIPTCORE_KEY4;
#define GET_FOO() ((int)(_pthread_getspecific_direct(fooKey)))
#define SET_FOO(to) (_pthread_setspecific_direct(fooKey, to))
int main(int ac, char* av[]) {
if (ac < 2) SET_FOO((void*)10);
else SET_FOO((void*)atoi(av[1]));
printf("foo = %d\n", GET_FOO());
return 0;
}
}}}
This is pretty close to what the GC does now. And compiling:
{{{
$ clang -O3 tls2.c
$ lldb ./a.out
Current executable set to './a.out' (x86_64).
(lldb) disassemble -m -n main
a.out`main
a.out[0x100000ef0]: pushq %rbp
a.out[0x100000ef1]: movq %rsp, %rbp
a.out[0x100000ef4]: cmpl $1, %edi
a.out[0x100000ef7]: jg 0x100000f08 ; main + 24
a.out[0x100000ef9]: movq $10, %gs:752
a.out[0x100000f06]: jmp 0x100000f1d ; main + 45
a.out[0x100000f08]: movq 8(%rsi), %rdi
a.out[0x100000f0c]: callq 0x100000f38 ; symbol stub for:
atoi
a.out[0x100000f11]: movslq %eax, %rax
a.out[0x100000f14]: movq %rax, %gs:752
a.out[0x100000f1d]: movq %gs:752, %rsi
a.out[0x100000f26]: leaq 59(%rip), %rdi ; "foo = %d\n"
a.out[0x100000f2d]: xorb %al, %al
a.out[0x100000f2f]: callq 0x100000f3e ; symbol stub for:
printf
a.out[0x100000f34]: xorl %eax, %eax
a.out[0x100000f36]: popq %rbp
a.out[0x100000f37]: ret
(lldb) r
Process 67488 launched: './a.out' (x86_64)
foo = 10
Process 67488 exited with status = 0 (0x00000000)
(lldb) ^D
$
}}}
This will probably only work on modern versions of XCode and OS X (10.8
etc.) In part, older libcs have very different implementations of
```pthread_setspecific_direct```, which means this could be very wrong on
older machines. I'm not sure how much older, so if we had any 10.7 users
who could try this that would be awesome. The build system will need
modifications to check for that, and fall back to the much slower routines
otherwise I suppose.
Simon, does this approach sound OK? I think it will recover the
performance loss here and we can just go ahead and use Clang, which is the
easiest for everybody I think.
--
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7602#comment:13>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list