Bad news: apparent bug in casMutVar going back to 7.2

Ryan Newton rrnewton at gmail.com
Sat Feb 1 14:10:53 UTC 2014


Hi Carter & others,

Carter, yes, this is CAS on pointers and in my next mail I'll try to come
up with some hypotheses as to why we may have (remaining) problems there.

But first, I have been assured that on x86 there is no failure mode in
which doing a comparison on the value read by CAS should not correctly
diagnose success or failure (same as directly reading the Zero Flag) [1].

And yet, there's this discrepancy, where the modified casMutVar that I
linked to does not have the failure.  As for reproducing the failure,
either of the two following tests will currently show problems:

   - Two threads try to casIORef False->True, both succeed
   - 120 threads try to read, increment, CAS until they succeed.  The total
   is often not 120 because multiple threads think the successfully
   incremented, say, 33->34.

Here's a specific recipe for the latter test on GHC 7.6.3 Mac or Linux:


*git clone git at github.com:rrnewton/haskell-lockfree-queue.git *
*cd haskell-lockfree-queue/AtomicPrimops/*

*git checkout 1a1e7e55f6706f9e5754*

*cabal sandbox init*

*cabal install -f-withTH -fforeign ./ ./testing --enable-tests*

*./testing/dist/dist-sandbox-*/build/test-atomic-primops/test-atomic-primops
-t n_threads*

You may have to run the last line several times to see the failure.

Best,
  -Ryan

[1] I guess the __sync_bool_compare_and_swap intrinsic which reads ZF is
there just to avoid the extra comparison.

[2] P.S. I'd like to try this on GHC head, but the RHEL 6 machine I usually
use to build it is currently not validating (below error, commit
65d05d7334).  After I debug this gmp problem I'll confirm that the bug
under discussion applies on the 7.8 branch.

    ./sync-all checkout ghc-7.8
    sh validate
...
/usr/bin/ld: libraries/integer-gmp/gmp/objs/aors.o: relocation R_X86_64_32
against `__gmpz_sub' can not be used when making a shared object; recompile
with -fPIC
libraries/integer-gmp/gmp/objs/aors.o: could not read symbols: Bad value




On Sat, Feb 1, 2014 at 2:55 AM, Carter Schonwald <carter.schonwald at gmail.com
> wrote:

> Ryan, is your benchmark using CAS on pointers, or immediate words? trying
> to get atomic primops to build on my 7.8 build on my mac
>
>
> On Sat, Feb 1, 2014 at 2:44 AM, Carter Schonwald <
> carter.schonwald at gmail.com> wrote:
>
>> https://ghc.haskell.org/trac/ghc/ticket/8724#ticket is the ticket
>>
>> when i'm more awake i'll experiment some more
>>
>>
>> On Sat, Feb 1, 2014 at 2:33 AM, Carter Schonwald <
>> carter.schonwald at gmail.com> wrote:
>>
>>> i have a ticket for tracking this, though i'm thinking my initial
>>> attempt at a patch generates the same object code as it did before.
>>>
>>> @ryan, what CPU variant are you testing this on? is this on a NUMA
>>> machine or something?
>>>
>>>
>>> On Sat, Feb 1, 2014 at 1:58 AM, Carter Schonwald <
>>> carter.schonwald at gmail.com> wrote:
>>>
>>>> woops, i mean cmpxchgq
>>>>
>>>>
>>>> On Sat, Feb 1, 2014 at 1:36 AM, Carter Schonwald <
>>>> carter.schonwald at gmail.com> wrote:
>>>>
>>>>> ok, i can confirm that on my 64bit mac, both clang and gcc
>>>>> use cmpxchgl rather than cmpxchg
>>>>> i'll whip up a strawman patch on head that can be cherrypicked /
>>>>> tested out by ryan et al
>>>>>
>>>>>
>>>>> On Sat, Feb 1, 2014 at 1:12 AM, Carter Schonwald <
>>>>> carter.schonwald at gmail.com> wrote:
>>>>>
>>>>>> Hey Ryan,
>>>>>> looking at this closely
>>>>>> Why isn't CAS using CMPXCHG8B on 64bit architectures?  Could that be
>>>>>> the culprit?
>>>>>>
>>>>>> Could the issue be that we've not had a good stress test that would
>>>>>> create values that are equal on the 32bit range, but differ on the 64bit
>>>>>> range, and you're hitting that?
>>>>>>
>>>>>> Could you try seeing if doing that change fixes things up?
>>>>>> (I may be completely wrong, but just throwing this out as a naive
>>>>>> "obvious" guess)
>>>>>>
>>>>>>
>>>>>> On Sat, Feb 1, 2014 at 12:58 AM, Ryan Newton <rrnewton at gmail.com>wrote:
>>>>>>
>>>>>>> Then again... I'm having trouble seeing how the spec on page 3-149
>>>>>>> of the Intel manual would allow the behavior I'm seeing:
>>>>>>>
>>>>>>>
>>>>>>> http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
>>>>>>>
>>>>>>> Nevertheless, this is exactly the behavior we're seeing with the
>>>>>>> current Haskell primops.  Two threads simultaneously performing the same
>>>>>>> CAS(p,a,b) can both think that they succeeded.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Feb 1, 2014 at 12:33 AM, Ryan Newton <rrnewton at gmail.com>wrote:
>>>>>>>
>>>>>>>> I commented on the commit here:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/ghc/ghc/commit/521b792553bacbdb0eec138b150ab0626ea6f36b
>>>>>>>>
>>>>>>>> The problem is that our "cas" routine in SMP.h is similar to the C
>>>>>>>> compiler intrinsic __sync_val_compare_and_swap, in that it returns the old
>>>>>>>> value.  But it seems we cannot use a comparison against that old value to
>>>>>>>> determine whether or not the CAS succeeded.  (I believe the CAS may fail
>>>>>>>> due to contention, but the old value may happen to look like our old value.)
>>>>>>>>
>>>>>>>> Unfortunately, this didn't occur to me until it started causing
>>>>>>>> bugs [1] [2].  Fixing casMutVar# fixes these bugs.  However, the way I'm
>>>>>>>> currently fixing CAS in the "atomic-primops" package is by using
>>>>>>>> __sync_bool_compare_and_swap:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/rrnewton/haskell-lockfree/commit/f9716ddd94d5eff7420256de22cbf38c02322d7a#diff-be3304b3ecdd8e1f9ed316cd844d711aR200
>>>>>>>>
>>>>>>>> What is the best fix for GHC itself?   Would it be ok for GHC to
>>>>>>>> include a C compiler intrinsic like __sync_val_compare_and_swap?  Otherwise
>>>>>>>> we need another big ifdbef'd function like "cas" in SMP.h that has the
>>>>>>>> architecture-specific inline asm across all architectures.  I can write the
>>>>>>>> x86 one, but I'm not eager to try the others.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>    -Ryan
>>>>>>>>
>>>>>>>> [1] https://github.com/iu-parfunc/lvars/issues/70
>>>>>>>> [2] https://github.com/rrnewton/haskell-lockfree/issues/15
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ghc-devs mailing list
>>>>>>> ghc-devs at haskell.org
>>>>>>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20140201/715694dd/attachment.html>


More information about the ghc-devs mailing list