FFI, safe vs unsafe

Mon Apr 3 21:55:19 EDT 2006

On Mon, Apr 03, 2006 at 02:00:33PM -0400, Wolfgang Thaller wrote:
> Sorry for the length of this. There are three sections: the first is  
> about how I don't like for "nonconcurrent" to be the default, the  
> second is about bound threads and the third is about implementing  
> concurrent reentrant on top of state threads.
> 
> >no, state-threads, a la NSPR, state-threads.sf.net, or any other of a
> >bunch of implementations.
> 
> Ah. I was thinking of old-style GHC or hugs only, where there is one  
> C stack and only the Haskell state is per-haskell-thread. My bad.
> So now that I know of an implementation method where they don't cause  
> the same problems they used to cause in GHC, I am no longer opposed  
> to the existance of nonconcurrent reentrant imports.

Any particular reason hugs and GHC didn't use the state-threads approach
out of curiosity? did it conflict with the push-enter model?  (jhc uses
the eval-apply model so I am more familier with that)

> To me, "nonconcurrent" is still nothing but a hint to the  
> implementation for improving performance; if an implementation  
> doesn't support concurrent reentrancy at all, that is a limitation of  
> the implementation.

It also implys that a function call will run on the same OS thread as
the OS thread the current haskell thread is running on. however, we need
not codify this behavior as some future implementations might not follow
it or have a well defined meaning for 'OS thread the current haskell
thread is running on' (GHC already doesn't when bound threads arn't used
I am led to believe?)

> Maybe the default should be "as concurrent as the implementation  
> supports", with an optional "nonconcurrent" annotation for  
> performance, and an optional "concurrent" annotation to ensure an  
> error/warning when the implementation does not support it. Of course,  
> implementations would be free to provide a flag *as a non-standard  
> extension* that changes the behaviour of unannotated calls.

A concurrent hint would be okay. I have a preference for including an
explicit annotation in that case. In fact, I'd support it if all
properties were explicitly annotated and we didn't allow a blank
specification.

My only real 'must-have' is that the 4 modes all can be explicitly and
unambiguously specified. I have opinions on the syntax/hints but that is
more flexable.

Another nice minor thing would be if haskell implementations were
required to ignore annotations starting with 'x-' for implementation
specific hints.

In jhc there are no such thing as compiler primitives*, there are only
FFI imports and a couple primitive imports that don't translate to code
(seq,unsafeCoerce). this means that things like 'log' and 'sin' and
every basic operation goes through the FFI mechanism so it needs to be
_fast_ _fast_. A neat side effect is that jhcs implementation of the
prelude is mostly portable to different compilers.

* almost true, for historical reasons I hope to fix there are a few
  built in numeric operators.

> ==== Bound Threads ====
> 
> In GHC, there is a small additional cost for each switch to and from  
> a bound thread, but no additional cost for actual foreign call-outs.
> For jhc, I think you could implement a similar system where there are  
> multiple OS threads, one of which runs multiple state threads; this  
> would have you end up in pretty much the same situation as GHC, with  
> the added bonus of being able to implement foreign import  
> nonconcurrent reentrant for greater performance.
> If you don't want to spend the time to implement that, then you could  
> go with a possibly simpler implementation involving inter-thread  
> messages for every foreign call from a bound thread, which would of  
> course be slow (that's the method I'd have recommended to hugs).

I am not quite sure whether you are saying something different from what
I plan for jhc or not, my current thinking for jhc is,

the single one true OS thread for all haskell threads in an EDSM loop
(epoll based). the EDSM loop has its own stack (and is, in fact, just
another haskell thread as the scheduler is implemented in haskell), each
haskell thread has its own stack.

non concurrent calls are just C 'call's. nothing more.

concurrent nonreentrant calls are desugared to haskell code that

FFI calls 'socketpair(2)'
pokes arguments into structure
FFI calls 'pthread_create'
  pthread_create is passed a function that unpacks the args, calls the
  foreign function, stores the result and writes one byte to one end of the socketpair.
calls 'Concurrent.IO.threadWaitRead' on the other end of the socket pair.
peeks the return value

(in practice, socketpair will be a much faster 'futex' on linux and OS
threads may or may not be cached)

very low overhead, probably the minimal possible for an arbitrary
unknown FFI call.

An alternate mode I'd like to experiment with one day is the complete
oposite side of the spectrum:

one OS thread per haskell thread, no guarentees about duplicated work
between threads.

However, the extra overhead will likely make this mode uncommon except
in certain specialized circumstances.

> If the per-call cost is an issue, we could have an annotation that  
> can be used whenever the programmer knows that a foreign function  
> does not access thread-local storage. This annotation, the act of  
> calling a foreign import from a forkIO'ed (=non-bound) thread, and  
> the act of calling a foreign import from a Haskell implementation  
> that does not support bound threads, all place this proof obligation  
> on the programmer. Therefore I'd want it to be an explicit  
> annotation, not the default.

well, the cost of bound threads is not the cost of the call itself, it
is that they must be serialized. foreign concurrent calls can run
concurrently to haskell code and each other. but 'bound' foreign calls
must wait for their dedicated thread to become available. I think there
needs to be an annotation as to which functions require boundness so
suddenly all foreign calls arn't serialized just because they are in a
'forkOS'ed thread.

in any case, the cost would end up on the foreign call in a cooperative
system most likely as it will have to find and wait for the specific
thread rather than just using the current one or creating a new one.

> We could also say that a modified form of the bound threads proposal  
> is actually mandatory; the implementation you have in mind would  
> support it with the following exceptions:
> 
> a) Foreign calls from forkIO'ed threads can read and write (a.k.a.  
> interfere with) the thread local state of the "main" OS thread;  
> people are not supposed to call functions that use thread local state  
> from forkIO'ed threads anyway.
> 
> b) Concurrent foreign imports might not see the appropriate thread  
> local state.
> 
> c) Call-ins from OS threads other than the main thread are not  
> allowed, therefore there is no forkOS and no runInBoundThread. (Or,  
> alternatively, call-ins from other OS threads create unbound threads  
> instead).

I'll have to think about these rules some. We will need to say something
about bound threads.

> ==== On the implementability of "concurrent reentrant" ====
> 
> >>It might not be absolutely easy to implement "concurrent reentrant",
> >>but it's no harder than concurrent non-reentrant calls.
> >
> >it is much much harder. you have to deal with your haskell run-time
> >being called into from an _alternate OS thread_ meaning you have to  
> >deal
> >with the os threading primitives and locking and mutexi and in general
> >pay a lot of the cost you would for a fully OS threaded  
> >implementation.
> 
> I don't follow your claim. The generated code for a foreign export  
> will have to
> a) check a thread-local flag/the current thread id to see whether we  
> are being called from a non-concurrent reentrant import or from  
> "elsewhere". Checking a piece of thread-local state is FAST.
> b) If we are "elsewhere", send an interthread message to the runtime  
> thread. The runtime thread will need to periodically check whether an  
> interthread message has arrived, and if there is no work, block  
> waiting for it. The fast path of checking whether something has been  
> posted to the message queue is fast indeed - you just have to check a  
> global flag.
I'd integrate it into the EDSM loop somehow (futex maybe) as I have a
moral adversion for periodic checking of anything. 

the main thing is that it is a cost paid by every foreign export.
perhaps a flag saying "this will only be called nonconcurrently on
exports" though, perhaps that can be an x-flag if other compilers can't
take advantage of it.

In my survey of when 'reentrant concurrent' was needed, I looked at all
the standard libraries and didn't find anywhere it was actually needed.
Are there some compelling examples of when it is really needed in a
setting that doesn't have OS threads to begin with? (I am not asserting
they don't exist, I just want to see some example uses of this feature
to get a better handle on the implementation cost)

> Remember, for concurrent non-reentrant, you will have to deal with  
> inter-OS-thread messaging, too.

just a wait on a pipe. which is exactly what the EDSM loop does anyway
so nothing threadspecific at all other than the call to pthread_create.

> About how fast thread-local state really is:
> __thread attribute on Linux: ~ 2 memory load instructions.
> __declspec(thread) in MSVC++ on Windows: about the same.
> pthread_getspecific on Mac OS X/x86 and Mac OS X/G5: ~10 instructions
> pthread_getspecific on Linux and TlsGetValue on Windows: ~10-20  
> instructions
> pthread_getspecific on Mac OS X/G4: a system call :-(.

how prevelant is support for __thread BTW? is it required by any
standards or an ELFism?

> Also, to just check whether you can use the fast-path call-in, you  
> could optimise things by just checking whether the stack pointer is  
> in the expected range for the runtime OS thread (fast case), or not  
> (slow case).
neat trick.

> All in all, I can't see a good excuse to not implement foreign import  
> concurrent reentrant when you've already implemented concurrent  
> nonreentrant.

I am beginning to agree perhaps. I still definitly want them to be declared
separately in the FFI declaration though. I will hapily trade some FFI
verbosity for greater future-proofness when desiging a language standard
that is meant to last.

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈