<div dir="ltr"><div dir="ltr"><div><div dir="auto">Awesome summary by Ben!</div></div><div><br></div><div>theres also some crazy hijinks needed currently to do compare and swap style stuff <br></div><div><br></div><div>see <a href="https://hackage.haskell.org/package/atomic-primops-0.8.2/docs/Data-Atomics.html">https://hackage.haskell.org/package/atomic-primops-0.8.2/docs/Data-Atomics.html</a> for some discussion. (i'm not sure what the current state of the world is on CAS tech)<br></div></div></div><div><br><div class="gmail_quote"><div dir="ltr">On Thu, Nov 29, 2018 at 9:45 AM Ben Gamari <<a href="mailto:ben@smart-cactus.org" target="_blank">ben@smart-cactus.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Travis Whitaker <<a href="mailto:pi.boy.travis@gmail.com" target="_blank">pi.boy.travis@gmail.com</a>> writes:<br>

<br>

> Hello GHC Devs,<br>

><br>

> I'm trying to get my head around ticket #15449 (<br>

> <a href="https://ghc.haskell.org/trac/ghc/ticket/15449" rel="noreferrer" target="_blank">https://ghc.haskell.org/trac/ghc/ticket/15449</a>). This gist of things is that<br>

> GHC generates incorrect aarch64 code that causes memory corruption in<br>

> multithreaded programs run on out-of-order machines. User trommler<br>

> discovered that similar issues are present on PowerPC, and indeed ARMv7 and<br>

> PowerPC support the same types of load/store reorderings. The LLVM code<br>

> emitted by GHC may be incorrect with respect to LLVM's memory model, but<br>

> this isn't a problem on architectures with minimal reordering like x86.<br>

><br>

Thank you for picking this up!<br>

<br>

> I had initially thought that GHC simply wasn't emitting the appropriate<br>

> LLVM fences; there's an elephant-gun-approach here (<br>

> <a href="https://github.com/TravisWhitaker/ghc/commits/ghc843-wip/T15449" rel="noreferrer" target="_blank">https://github.com/TravisWhitaker/ghc/commits/ghc843-wip/T15449</a>) that<br>

> guards each atomic operation with a full barrier. I still believe that GHC<br>

> is omitting necessary LLVM fences, but this change is insufficient to fix<br>

> the behavior of the test case (which is simply GHC itself compiling a test<br>

> package with '-jN', N > 1).<br>

><br>

> It seems there's a long and foggy history of the Cmm memory model. Edward<br>

> Yang discusses this a bit in his post here (<br>

> <a href="http://blog.ezyang.com/2014/01/so-you-want-to-add-a-new-concurrency-primitive-to-ghc/" rel="noreferrer" target="_blank">http://blog.ezyang.com/2014/01/so-you-want-to-add-a-new-concurrency-primitive-to-ghc/</a>)<br>

> and issues similar to #15449 have plagued GHC in the past, like #12469 (<br>

> <a href="https://ghc.haskell.org/trac/ghc/ticket/12469" rel="noreferrer" target="_blank">https://ghc.haskell.org/trac/ghc/ticket/12469</a>). Worryingly, GHC only has<br>

> MO_WriteBarrier, whereas PowerPC and ARMv7 really need read, write, and<br>

> full memory barriers. On ARM an instruction memory barrier might be<br>

> required as well, but I don't know enough about STG/Cmm to say for sure,<br>

> and it'd likely be LLVM's responsibility to emit that anyway.<br>

><br>

In my opinion GHC's current memory model story is quite unsustainable. As<br>

you point out, we currently have a limited selection of plain barriers<br>

and a few atomic operations, none of which are adequately documented<br>

given the subtlety they involve.<br>

<br>

I think we would at this point be foolish not to take advantage of the<br>

research that has been done in this area by moving to C11-style<br>

acquire/release semantics wherever possible. While following through on<br>

this is not a small task, these semantics are both well defined and<br>

easier to reason about intuitively. We might also be able to benefit<br>

from the wealth of model checking tools that are now available for this<br>

formalism.<br>

<br>

Ulan Degenbaev (CC'd) and I discussed him possibly picking up the task<br>

of moving to C11 atomics a few weeks ago.<br>

<br>

> I'm hoping that someone with more tribal knowledge than I might be able to<br>

> give me some pointers with regards to the following areas:<br>

><br>

><br>

>    - Does STG itself have anything like a memory model? My intuition says<br>

>    'definitely not', but given that STG expressions may contain Cmm operations<br>

>    (via StgCmmPrim), there may be STG-to-STG transformations that need to care<br>

>    about the target machine's memory model.<br>

<br>

As far as I know, it has nothing formally, or even informally, defined.<br>

That being said, relatively few of our primops make any guarantees about<br>

their operation in a concurrent setting. The cases that I can think of<br>

include `casArray#`, `atomicModifyIORef#`, `atomic{Read,Write}*Array#`<br>

and the STM operations.<br>

<br>

>    - With respect to Cmm, what reorderings does GHC perform? What are the<br>

>    relevant parts of the compiler to begin studying?<br>

<br>

As far as I know, very few. We only perform a handful of optimizations<br>

on C--. These include,<br>

<br>

 * Sinking of assignments (see compiler/cmm/CmmSink.hs); see<br>

   CmmSink.conflicts for what commutation we allow. We notably don't<br>

   allow sinking of any memory assignments past calls (including<br>

   primops)<br>

<br>

 * Simple constant folding (see CmmOpt.cmmMachOpFoldM)<br>

<br>

 * Common block elimination (see CmmCommonBlockElim)<br>

<br>

 * Some simple control-flow optimisation (see CmmContFlowOpt)<br>

<br>

>    - Are the LLVM atomics that GHC emits correct with respect to the LLVM<br>

>    memory model? As it stands now LLVM fences are only emitted for<br>

>    MO_WriteBarrier. Without fences accompanying the atomics, it seems the LLVM<br>

>    compiler could float dependent loads/stores past atomic operations.<br>

<br>

Frankly, I would be surprised if they are correct. Few people have<br>

really looked at GHC's memory ordering properties and even fewer have<br>

looked at those of the LLVM backend.<br>

<br>

>    - Why is MO_WriteBarrier the only provided memory barrier? My hunch is<br>

>    that it's because this is the only sort of barrier required on x86, which<br>

>    only allows loads to be reordered with older stores, but perhaps I'm<br>

>    missing something? Is it plausible that Cmm simply needs additional barrier<br>

>    primitives to target these weaker memory models? Conversely, is there some<br>

>    property of Cmm that let's us get away without read barriers at all?<br>

><br>

Note that there are a few others defined in SMP.h (namely<br>

store_load_barrier and load_load_barrier). However, see the discussion<br>

above. My understanding is that there were just very few formalisms for<br>

reasoning about weak memory models when this code was written. The<br>

literature now contains a much better understanding of this field but<br>

GHC has yet to catch up.<br>

<br>

<br>

Cheers,<br>

<br>

- Ben<br>

<br>

_______________________________________________<br>

ghc-devs mailing list<br>

<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a><br>

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a><br>

</blockquote></div></div>