FFI, safe vs unsafe
wolfgang.thaller at gmx.net
Mon Apr 3 14:00:33 EDT 2006
Sorry for the length of this. There are three sections: the first is
about how I don't like for "nonconcurrent" to be the default, the
second is about bound threads and the third is about implementing
concurrent reentrant on top of state threads.
> no, state-threads, a la NSPR, state-threads.sf.net, or any other of a
> bunch of implementations.
Ah. I was thinking of old-style GHC or hugs only, where there is one
C stack and only the Haskell state is per-haskell-thread. My bad.
So now that I know of an implementation method where they don't cause
the same problems they used to cause in GHC, I am no longer opposed
to the existance of nonconcurrent reentrant imports.
To me, "nonconcurrent" is still nothing but a hint to the
implementation for improving performance; if an implementation
doesn't support concurrent reentrancy at all, that is a limitation of
I think that this is a real problem for libraries; library writers
will have to choose whether they preclude their library from being
used in multithreaded programs or whether they want to sacrifice
portability (unless they spend the time messing around with cpp or
something like it).
Some foreign calls are known never to take much time; those can be
annotated as nonconcurrent. For calls that might take nontrivial
amounts of time, the question whether they should be concurrent or
not *cannot be decided locally*; it depends on what other code is
running in the same program.
Maybe the default should be "as concurrent as the implementation
supports", with an optional "nonconcurrent" annotation for
performance, and an optional "concurrent" annotation to ensure an
error/warning when the implementation does not support it. Of course,
implementations would be free to provide a flag *as a non-standard
extension* that changes the behaviour of unannotated calls.
==== Bound Threads ====
In GHC, there is a small additional cost for each switch to and from
a bound thread, but no additional cost for actual foreign call-outs.
For jhc, I think you could implement a similar system where there are
multiple OS threads, one of which runs multiple state threads; this
would have you end up in pretty much the same situation as GHC, with
the added bonus of being able to implement foreign import
nonconcurrent reentrant for greater performance.
If you don't want to spend the time to implement that, then you could
go with a possibly simpler implementation involving inter-thread
messages for every foreign call from a bound thread, which would of
course be slow (that's the method I'd have recommended to hugs).
If the per-call cost is an issue, we could have an annotation that
can be used whenever the programmer knows that a foreign function
does not access thread-local storage. This annotation, the act of
calling a foreign import from a forkIO'ed (=non-bound) thread, and
the act of calling a foreign import from a Haskell implementation
that does not support bound threads, all place this proof obligation
on the programmer. Therefore I'd want it to be an explicit
annotation, not the default.
> "if an implementation supports haskell code running on multiple OS
> threads, it must support the bound threads proposal. if it does not,
> then all 'nonconcurrent' foreign calls must be made on the one true OS
*) "Haskell code running on multiple OS threads" is irrelevant. Only
the FFI allows you to observe which OS thread you are running in.
This should be worded in terms of what kind of concurrent FFI calls
are supported, or whether call-in from arbitrary OS threads is
*) Note though that this makes it *impossible* to make a concurrent
call to one of Apple's GUI libraries (both Carbon and Cocoa insist on
being called from the OS thread that runs the C main function). So
good-bye to calculating things in the background while a GUI is
waiting for user input.
We could also say that a modified form of the bound threads proposal
is actually mandatory; the implementation you have in mind would
support it with the following exceptions:
a) Foreign calls from forkIO'ed threads can read and write (a.k.a.
interfere with) the thread local state of the "main" OS thread;
people are not supposed to call functions that use thread local state
from forkIO'ed threads anyway.
b) Concurrent foreign imports might not see the appropriate thread
c) Call-ins from OS threads other than the main thread are not
allowed, therefore there is no forkOS and no runInBoundThread. (Or,
alternatively, call-ins from other OS threads create unbound threads
==== On the implementability of "concurrent reentrant" ====
>> It might not be absolutely easy to implement "concurrent reentrant",
>> but it's no harder than concurrent non-reentrant calls.
> it is much much harder. you have to deal with your haskell run-time
> being called into from an _alternate OS thread_ meaning you have to
> with the os threading primitives and locking and mutexi and in general
> pay a lot of the cost you would for a fully OS threaded
I don't follow your claim. The generated code for a foreign export
will have to
a) check a thread-local flag/the current thread id to see whether we
are being called from a non-concurrent reentrant import or from
"elsewhere". Checking a piece of thread-local state is FAST.
b) If we are "elsewhere", send an interthread message to the runtime
thread. The runtime thread will need to periodically check whether an
interthread message has arrived, and if there is no work, block
waiting for it. The fast path of checking whether something has been
posted to the message queue is fast indeed - you just have to check a
global flag. So no locking and mutexes -- sorry, I don't buy
"mutexi" ;-) -- in your regular code.
What is so hard or so inefficient about this?
Remember, for concurrent non-reentrant, you will have to deal with
inter-OS-thread messaging, too.
About how fast thread-local state really is:
__thread attribute on Linux: ~ 2 memory load instructions.
__declspec(thread) in MSVC++ on Windows: about the same.
pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions
pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20
pthread_getspecific on Mac OS X/G4: a system call :-(.
Also, to just check whether you can use the fast-path call-in, you
could optimise things by just checking whether the stack pointer is
in the expected range for the runtime OS thread (fast case), or not
All in all, I can't see a good excuse to not implement foreign import
concurrent reentrant when you've already implemented concurrent
More information about the Haskell-prime