RTS SMP: Use compiler built-ins on all platforms.
ClosedPublic

Authored by trommler on May 17 2016, 1:08 PM.

Details

Summary

Use C compiler builtins for atomic SMP primitives. This saves a lot
of CPP ifdefs.

Add test for atomic xchg:
Test if __sync_lock_test_and_set() builtin stores the second argument.
The gcc manual says the actual value stored is implementation defined.

Test Plan

validate and eyeball generated assembler code

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
trommler updated this revision to Diff 7626.May 17 2016, 1:08 PM
trommler retitled this revision from to RTS SMP: Use compiler built-ins on all platforms..
trommler updated this object.
trommler edited the test plan for this revision. (Show Details)
simonmar requested changes to this revision.May 17 2016, 2:25 PM
simonmar edited edge metadata.
simonmar added inline comments.
includes/stg/SMP.h
115–130

I found this comment a bit hard to follow, I suggest: first say what the function's semantics are, then mention that we have a test for it, and finally note that the primitive is not implemented with the intended semantics on all architectures by gcc, quoting the gcc manual.

127

This is a weaker barrier than we had before. I presume since you've eyeballed the code that it doesn't affect us, but it's worth noting in a comment.

This revision now requires changes to proceed.May 17 2016, 2:25 PM

Add x86_64 assembler output.

includes/stg/SMP.h
127

Here is the code for x86_64 compiled by gcc 4.3.4.
This is the old assembly:

xchg (%rdi),%rsi
movq    %rsi, %rax
ret

This is the new assembly (64 bit word):

movq    %rsi, %rax
xchgq   (%rdi), %rax
ret

If I understand this correctly the barrier is not weaker on X86_64.

On PowerPC we issue no barrier before the load and reserve and an import barrier (isync) after the store. That isync was missing before Trac #12070.

I have not looked at ARM and SPARC.

OTOH, the comment at the top of this file talks about an "atomic exchange operation" and does not mention other barriers. Am I missing something?

simonmar added inline comments.May 19 2016, 2:33 AM
includes/stg/SMP.h
127

It's a weaker barrier according to the gcc docs: https://gcc.gnu.org/onlinedocs/gcc-4.4.3/gcc/Atomic-Builtins.html

"This builtin is not a full barrier, but rather an acquire barrier. This means that references after the builtin cannot move to (or be speculated to) before the builtin, but previous memory stores may not be globally visible yet, and previous memory loads may not yet be satisfied. "

The code we had before had the constraint "+m" which I think is a full barrier (for the compiler; xchg is already a full barrier for the CPU).

I think it's probably ok, but it would be a good idea to look at some of the asm to check. Since this is an inline function we need to look at the assembly in the places it's called, not just the extern copy of the code. It's used in lockClosure which itself is inlined and used in a few places.

hvr edited edge metadata.May 19 2016, 4:34 AM

would it make any sense to combine __sync_synchronize() with __sync_lock_test_and_set()?

In D2233#64807, @hvr wrote:

would it make any sense to combine __sync_synchronize() with __sync_lock_test_and_set()?

No, that would only hurt performance.

I will investigate @simonmar 's comment some more. So far my gut feeling is that the weaker barrier is sufficient. I still need to look at the code that unlocks a closure some more. If it has the memory semantics of a lock release then we are good to go. (Note: We cannot use __sync_lock_release() because it would write 0 to the memory location but we want a different value to be written).

FWIW: A build of ghc 8.0.1 (where we use the builtins already) on a POWER8 machine which has way more relaxed memory consistency by default than an Intel machine did not show any SMP-related failures.

trommler added inline comments.May 25 2016, 12:08 PM
includes/stg/SMP.h
127

There are two questions:

  • Is the barrier emitted by the compiler strong enough? Yes, an acquire barrier is sufficient because the release operation (unlockClosure) issues a store-store barrier (write_barrier) before it releases the lock.
  • Does the compiler move the lock operation around in optimizations, i.e. is the software barrier too weak? The x86_64 assembler code generated for rts/RaiseAsync.c is fine. The xchg operation was not moved around.

BTW: +m is not a memory barrier, but it says that the respective parameter (here: *p) is input /output (+) and in memory (m). See gcc asm modifiers and gcc asm constraints.

simonmar accepted this revision.May 26 2016, 3:02 AM
simonmar edited edge metadata.

Thanks! Could you move the relevant bits of explanation to comments in the code please?

This revision is now accepted and ready to land.May 26 2016, 3:02 AM
trommler updated this object.May 27 2016, 3:33 AM
trommler edited edge metadata.
trommler updated this revision to Diff 7752.May 27 2016, 5:00 AM
  • update file path in comment
  • Improve comments after code review.
This revision was automatically updated to reflect the committed changes.