[GHC] #11075: Confusing parallel spark behaviour with safe FFI calls
GHC
ghc-devs at haskell.org
Tue Nov 10 01:19:08 UTC 2015
#11075: Confusing parallel spark behaviour with safe FFI calls
-------------------------------------+-------------------------------------
Reporter: duncan | Owner:
Type: bug | Status: new
Priority: normal | Milestone:
Component: Runtime | Version: 7.10.2
System |
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: None/Unknown
Unknown/Multiple |
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
This is a tricky problem, and I don't see any easy solution, but here it
is for the sake of documenting and tracking...
When we create a bunch of 'par' sparks, e.g. via a strategy, the expected
behaviour is that we get one thread per core that will churn through those
sparks, one by one.
The one by one is important, each core/cap/hec is only actively evaluating
one of these sparks at a time. It's not like we have a thread per spark
and they're all evaluating in parallel. The parallelism is limited to the
core count (well, cap/hec).
However, if each of these sparks makes a safe FFI call then the one-by-one
behaviour is broken. Instead we get an OS thread per spark all running
concurrently.
If one looks at the mechanisms in detail one can see why this happens, but
the overall effect can be disastrous for performance. If we have a small
number of cores e.g. 8 and a large number of sparks e.g. 1024, then
spawning 1024 OS threads each of which has plenty of work to do is not a
recipe for good performance. The OS will share the time fairly which means
each thread will get terrible CPU cache utilisation. Think of something
like big numerical calculations where the cache utilisation is critical:
running 100s of OS threads per core doing numerical calculations will be
much worse than running one thread that churns through the calculations
sequentially.
So why does this happen? Well, when a cap is idle the cap's scheduler
checks if it has any sparks in its spark pool, or in other cap's spark
pools. If there are sparks available it grabs one and starts a Haskell
thread to run it. Now the Haskell thread makes a safe FFI call. That means
the Haskell thread is put to sleep while the foreign call runs in a new OS
thread. But this now means that the scheduler thinks that this cap is idle
again (there are no runnable Haskell threads), so it grabs another spark.
And so another OS thread gets made to run the FFI call. And again and
again, until we've converted all the sparks and forked off loads of OS
threads to run these FFI calls.
The problem is that the scheduler cannot tell the difference between a
safe FFI call that is blocking and taking no CPU time (like waiting on
network IO) vs a safe FFI call that is not blocking and is fully utilising
the CPU (like big numerical calculations). In the latter case we would
really like to say that the cap is not really idle, we don't need to
create more work to do. In the former case it's important to create more
work to do. But there's no easy way to tell the difference between these
two cases, so this is a hard problem to fix.
If a programmer knows that this is what is going on then they can work
around the problem by being very careful with the number of sparks they
make (kind of losing some of the benefits of sparks) or they can make all
the expensive FFI calls unsafe (which can be tricky when using reusable
FFI binding libs).
But in practice most GHC users will never figure out that this is what is
happening (plausibly it might be possible from looking at the threadscope
view very careful).
So that's the conundrum: a problem that users will find hard to identify,
and no easy automatic solution. Ho hum.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/11075>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list