About Haskell Thread Model

Mon Oct 13 10:28:24 EDT 2003

William Lee Irwin III <wli at holomorphy.com> asks about true SMP
>> cmpxchg and then taking a blocking lock sounds like the 2-tier locking
>> supported with Linux' new futex system calls. I wonder how they chose
>> to block in the older GHC RTS.

On Mon, Oct 13, 2003 at 12:02:52PM -0400, Jan-Willem Maessen wrote:
> This invariably proves tricky, even more so now that GHC effectively
> uses a many-to-one threading model (and where it may therefore be
> wrong to simply block the OS-level thread).  In pH, which isn't quite
> haskell but has the same syntax and runs on multiple processors with a
> shared heap, we kludged: we simply called "yield" or "sleep" and kept
> spinning.  We were actually using work stealing (with lock-free work
> queues as I recall) along with wait queues on empty locations, so we
> didn't sleep very often.  Nonetheless, this did occasionally turn out
> to cause real performance problems.

Abusing sched_yield() like this has been a performance problem with
threading engines on Linux for a while, though usually as the middle
tier of 3-tier locks as opposed to being the sole blocking primitive.

On Mon, Oct 13, 2003 at 12:02:52PM -0400, Jan-Willem Maessen wrote:
> In reality you probably want to use some sort of thin lock / fat lock
> approach a la Java.  Is the location thinly locked?  If so, CAS in a
> fat lock instead, then wait on that.  Now update requires a CAS as
> well, to see whether the thin lock was fattened.  If so, the fat lock
> must be unlocked and deallocated.
> Naturally, we'd like to keep track of unshared objects so we can avoid
> any kind of complicated update protocol in the common case.  This
> requires the implementor to establish a careful and clear contract
> between the compiler, the GC, and the thread scheduler.

This sounds very heavyweight. I'd be tempted to start looking at
lockless algorithms from the outset if you need to synchronize such
basic/frequently exercised operations.

On Mon, Oct 13, 2003 at 12:02:52PM -0400, Jan-Willem Maessen wrote:
> Finally, the complexity of thunk update pales in comparison to the
> complexity of multiprocessor GC.  In pH we punted on this
> issue---debugging a GC with per-thread allocation pools was hard
> enough.  It killed us; thread performance didn't scale because GC
> didn't scale.  Eager Haskell was designed with multithreaded
> shared-memory execution in mind, but I simply couldn't find the year
> or more it would take to build and debug a true multiprocessor GC---so
> it remains a uniprocessor implementation.

I've not really heard horror stories per se, but indirectly I've gotten
the idea that the JVM's out there spent a lot of time on this. I suspect
some of the Java userspace lock contention I've seen was related to GC.

-- wli