FFI, safe vs unsafe

Mon Apr 3 14:00:33 EDT 2006

Sorry for the length of this. There are three sections: the first is  
about how I don't like for "nonconcurrent" to be the default, the  
second is about bound threads and the third is about implementing  
concurrent reentrant on top of state threads.

> no, state-threads, a la NSPR, state-threads.sf.net, or any other of a
> bunch of implementations.

Ah. I was thinking of old-style GHC or hugs only, where there is one  
C stack and only the Haskell state is per-haskell-thread. My bad.
So now that I know of an implementation method where they don't cause  
the same problems they used to cause in GHC, I am no longer opposed  
to the existance of nonconcurrent reentrant imports.

To me, "nonconcurrent" is still nothing but a hint to the  
implementation for improving performance; if an implementation  
doesn't support concurrent reentrancy at all, that is a limitation of  
the implementation.
I think that this is a real problem for libraries; library writers  
will have to choose whether they preclude their library from being  
used in multithreaded programs or whether they want to sacrifice  
portability (unless they spend the time messing around with cpp or  
something like it).

Some foreign calls are known never to take much time; those can be  
annotated as nonconcurrent. For calls that might take nontrivial  
amounts of time, the question whether they should be concurrent or  
not *cannot be decided locally*; it depends on what other code is  
running in the same program.

Maybe the default should be "as concurrent as the implementation  
supports", with an optional "nonconcurrent" annotation for  
performance, and an optional "concurrent" annotation to ensure an  
error/warning when the implementation does not support it. Of course,  
implementations would be free to provide a flag *as a non-standard  
extension* that changes the behaviour of unannotated calls.

==== Bound Threads ====

In GHC, there is a small additional cost for each switch to and from  
a bound thread, but no additional cost for actual foreign call-outs.
For jhc, I think you could implement a similar system where there are  
multiple OS threads, one of which runs multiple state threads; this  
would have you end up in pretty much the same situation as GHC, with  
the added bonus of being able to implement foreign import  
nonconcurrent reentrant for greater performance.
If you don't want to spend the time to implement that, then you could  
go with a possibly simpler implementation involving inter-thread  
messages for every foreign call from a bound thread, which would of  
course be slow (that's the method I'd have recommended to hugs).

If the per-call cost is an issue, we could have an annotation that  
can be used whenever the programmer knows that a foreign function  
does not access thread-local storage. This annotation, the act of  
calling a foreign import from a forkIO'ed (=non-bound) thread, and  
the act of calling a foreign import from a Haskell implementation  
that does not support bound threads, all place this proof obligation  
on the programmer. Therefore I'd want it to be an explicit  
annotation, not the default.

> "if an implementation supports haskell code running on multiple OS
> threads, it must support the bound threads proposal. if it does not,
> then all 'nonconcurrent' foreign calls must be made on the one true OS
> thread"

*) "Haskell code running on multiple OS threads" is irrelevant. Only  
the FFI allows you to observe which OS thread you are running in.  
This should be worded in terms of what kind of concurrent FFI calls  
are supported, or whether call-in from arbitrary OS threads is  
supported.

*) Note though that this makes it *impossible* to make a concurrent  
call to one of Apple's GUI libraries (both Carbon and Cocoa insist on  
being called from the OS thread that runs the C main function). So  
good-bye to calculating things in the background while a GUI is  
waiting for user input.

We could also say that a modified form of the bound threads proposal  
is actually mandatory; the implementation you have in mind would  
support it with the following exceptions:

a) Foreign calls from forkIO'ed threads can read and write (a.k.a.  
interfere with) the thread local state of the "main" OS thread;  
people are not supposed to call functions that use thread local state  
from forkIO'ed threads anyway.

b) Concurrent foreign imports might not see the appropriate thread  
local state.

c) Call-ins from OS threads other than the main thread are not  
allowed, therefore there is no forkOS and no runInBoundThread. (Or,  
alternatively, call-ins from other OS threads create unbound threads  
instead).

==== On the implementability of "concurrent reentrant" ====

>> It might not be absolutely easy to implement "concurrent reentrant",
>> but it's no harder than concurrent non-reentrant calls.
>
> it is much much harder. you have to deal with your haskell run-time
> being called into from an _alternate OS thread_ meaning you have to  
> deal
> with the os threading primitives and locking and mutexi and in general
> pay a lot of the cost you would for a fully OS threaded  
> implementation.

I don't follow your claim. The generated code for a foreign export  
will have to
a) check a thread-local flag/the current thread id to see whether we  
are being called from a non-concurrent reentrant import or from  
"elsewhere". Checking a piece of thread-local state is FAST.
b) If we are "elsewhere", send an interthread message to the runtime  
thread. The runtime thread will need to periodically check whether an  
interthread message has arrived, and if there is no work, block  
waiting for it. The fast path of checking whether something has been  
posted to the message queue is fast indeed - you just have to check a  
global flag. So no locking and mutexes -- sorry, I don't buy  
"mutexi" ;-) -- in your regular code.
What is so hard or so inefficient about this?

Remember, for concurrent non-reentrant, you will have to deal with  
inter-OS-thread messaging, too.

About how fast thread-local state really is:
__thread attribute on Linux: ~ 2 memory load instructions.
__declspec(thread) in MSVC++ on Windows: about the same.
pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions
pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20  
instructions
pthread_getspecific on Mac OS X/G4: a system call :-(.

Also, to just check whether you can use the fast-path call-in, you  
could optimise things by just checking whether the stack pointer is  
in the expected range for the runtime OS thread (fast case), or not  
(slow case).

All in all, I can't see a good excuse to not implement foreign import  
concurrent reentrant when you've already implemented concurrent  
nonreentrant.

Cheers,

Wolfgang