memory ordering

Tue Dec 31 14:04:22 UTC 2013

I was thinking about my response, and realized there was one major
misleading thing in my description.  The load reordering I described
applies to load instructions in C-- proper, i.e. things that show up
in the C-- dup as:

    W_ x = I64[...addr...]

Reads to IORefs and reads to vectors get compiled inline (as they
eventually translate into inline primops), so my admonitions are
applicable.

However, the story with *foreign primops* (which is how loadLoadBarrier
in atomic-primops is defined, how you might imagine defining a custom
read function as a primop) is a little different.  First, what does a
call to an foreign primop compile into? It is *not* inlined, so it will
eventually get compiled into a jump (this could be a problem if you're
really trying to squeeze out performance!)  Second, the optimizer is a
bit more conservative when it comes to primop calls (internally referred
to as "unsafe foreign calls"); at the moment, the optimizer assumes
these foreign calls clobber heap memory, so we *automatically* will not
push loads/stores beyond this boundary. (NB: We reserve the right to
change this in the future!)

This is probably why atomic-primops, as it is written today, seems to
work OK, even in the presence of the optimizer.  But I also have a hard
time believing it gives the speedups you want, due to the current
design. (CC'd Ryan Newton, because I would love to be wrong here, and
maybe he can correct me on this note.)

Cheers,
Edward

P.S. loadLoadBarrier compiles to a no-op on x86 architectures, but
because it's not inlined I think you will still end up with a jump (LLVM
might be able to eliminate it).

Excerpts from John Lato's message of 2013-12-31 03:01:58 +0800:
> Hi Edward,
> 
> Thanks very much for this reply, it answers a lot of questions I'd had.
>  I'd hoped that ordering would be preserved through C--, but c'est la vie.
>  Optimizing compilers are ever the bane of concurrent algorithms!
> 
> stg/SMP.h does define a loadLoadBarrier, which is exposed in Ryan Newton's
> atomic-primops package.  From the docs, I think that's a general read
> barrier, and should do what I want.  Assuming it works properly, of course.
>  If I'm lucky it might even be optimized out.
> 
> Thanks,
> John
> 
> On Mon, Dec 30, 2013 at 6:04 AM, Edward Z. Yang <ezyang at mit.edu> wrote:
> 
> > Hello John,
> >
> > Here are some prior discussions (which I will attempt to summarize
> > below):
> >
> >     http://www.haskell.org/pipermail/haskell-cafe/2011-May/091878.html
> >     http://www.haskell.org/pipermail/haskell-prime/2006-April/001237.html
> >     http://www.haskell.org/pipermail/haskell-prime/2006-March/001079.html
> >
> > The guarantees that Haskell and GHC give in this area are hand-wavy at
> > best; at the moment, I don't think Haskell or GHC have a formal memory
> > model—this seems to be an open research problem. (Unfortunately, AFAICT
> > all the researchers working on relaxed memory models have their hands
> > full with things like C++ :-)
> >
> > If you want to go ahead and build something that /just/ works for a
> > /specific version/ of GHC, you will need to answer this question
> > separately for every phase of the compiler.  For Core and STG, monads
> > will preserve ordering, so there is no trouble.  However, for C--, we
> > will almost certainly apply optimizations which reorder reads (look at
> > CmmSink.hs).  To properly support your algorithm, you will have to add
> > some new read barrier mach-ops, and teach the optimizer to respect them.
> > (This could be fiendishly subtle; it might be better to give C-- a
> > memory model first.)  These mach-ops would then translate into
> > appropriate arch-specific assembly or LLVM instructions, preserving
> > the guarantees further.
> >
> > This is not related to your original question, but the situation is a
> > bit better with regards to reordering stores: we have a WriteBarrier
> > MachOp, which in principle, prevents store reordering.  In practice, we
> > don't seem to actually have any C-- optimizations that reorder stores.
> > So, at least you can assume these will work OK!
> >
> > Hope this helps (and is not too inaccurate),
> > Edward
> >
> > Excerpts from John Lato's message of 2013-12-20 09:36:11 +0800:
> > > Hello,
> > >
> > > I'm working on a lock-free algorithm that's meant to be used in a
> > > concurrent setting, and I've run into a possible issue.
> > >
> > > The crux of the matter is that a particular function needs to perform the
> > > following:
> > >
> > > > x <- MVector.read vec ix
> > > > position <- readIORef posRef
> > >
> > > and the algorithm is only safe if these two reads are not reordered (both
> > > the vector and IORef are written to by other threads).
> > >
> > > My concern is, according to standard Haskell semantics this should be
> > safe,
> > > as IO sequencing should guarantee that the reads happen in-order.  Of
> > > course this also relies upon the architecture's memory model, but x86
> > also
> > > guarantees that reads happen in order.  However doubts remain; I do not
> > > have confidence that the code generator will handle this properly.  In
> > > particular, LLVM may freely re-order loads of NotAtomic and Unordered
> > > values.
> > >
> > > The one hope I have is that ghc will preserve IO semantics through the
> > > entire pipeline.  This seems like it would be necessary for proper
> > handling
> > > of exceptions, for example.  So, can anyone tell me if my worries are
> > > unfounded, or if there's any way to ensure the behavior I want?  I could
> > > change the readIORef to an atomicModifyIORef, which should issue an
> > mfence,
> > > but that seems a bit heavy-handed as just a read fence would be
> > sufficient
> > > (although even that seems more than necessary).
> > >
> > > Thanks,
> > > John L.
> >