NUMA support
ClosedPublic

Authored by simonmar on May 11 2016, 3:32 PM.

Details

Summary

The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.

Linux provides a NUMA API for doing two things:

  • Allocating memory local to a particular node
  • Binding a thread to a particular node

When given the +RTS --numa flag, the runtime will

  • Determine the number of NUMA nodes (N) by querying the OS
  • Assign capabilities to nodes, so cap C is on node C%N
  • Bind worker threads on a capability to the correct node
  • Keep a separate free lists in the block layer for each node
  • Allocate the nursery for a capability from node-local memory
  • Allocate blocks in the GC from node-local memory

For example, using nofib/parallel/queens on a 24-core 2-socket machine:

$ ./Main 15 +RTS -N24 -s -A64m
  Total   time  173.960s  (  7.467s elapsed)

$ ./Main 15 +RTS -N24 -s -A64m --numa
  Total   time  150.836s  (  6.423s elapsed)

The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).

According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using --numa.

TODO: yes it needs some docs too.

Test Plan
  • validate
  • There's a new flag --debug-numa=<n> that pretends to do NUMA without actually making the OS calls, which is useful for testing the code on non-NUMA systems.
  • TODO: I need to add some unit tests

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
simonmar updated this revision to Diff 7529.May 11 2016, 3:32 PM
simonmar retitled this revision from to NUMA support.
simonmar updated this object.
simonmar edited the test plan for this revision. (Show Details)
simonmar updated this object.May 11 2016, 3:33 PM
simonmar edited edge metadata.
bgamari edited edge metadata.May 12 2016, 3:52 AM

Here's a quick pass through the patch. Looks pretty sensible on the whole, although it would be great to have a Note somewhere explaining what we are trying to do (e.g. assigning capabilities to NUMA nodes, each node gets its own set of blocks) and some of the implementation details (why the maximum node limit)?

includes/rts/Constants.h
301

It would be nice if you described why this limit exists (or do so in a note with a reference here).

rts/Capability.c
719

For what it's worth I find the C99 declaration-of-accumulator-in-the-for-clause style to be clearer than placing declarations at the beginning of the function.

rts/RtsFlags.c
223

I'm happy to see this renamed; it always seemed a bit inconsistent.

781

Perhaps this 4 should instead be MAX_NUMA_NODES?

806

Use MAX_NUMA_NODES here as well.

rts/Task.h
118

It might be clearer to say "The NUMA node this Task belongs to"

rts/posix/OSMem.c
319

Might it be worth adding the token TODO here or creating a Trac ticket so this can be found later?

simonmar updated this revision to Diff 7576.May 13 2016, 1:31 PM
simonmar edited edge metadata.
  • add docs
  • allow external threads to set affinity via rts_setInCallCapability()
  • improve comments
  • fix some suggestions
simonmar updated this revision to Diff 7577.May 13 2016, 1:33 PM
simonmar marked an inline comment as done.
simonmar edited edge metadata.

typo

simonmar marked 3 inline comments as done.May 13 2016, 1:36 PM
simonmar added inline comments.
rts/Capability.c
719

I don't think we're doing that anywhere else, and I'm not sure about the status of C99 support. Didn't we have a thread about that at some point?

erikd added inline comments.May 15 2016, 5:32 AM
rts/Capability.c
719

I thik we have pretty much decided that C99 is OK and we'll be introducing it as we go. New code is a prime candidate for using C99 from the start.

rts/Task.c
515

Is there a way to do this without a #ifdef? Even something at the top of the file doing:

#ifdef DEBUG
#define DEBUG_FAKE_NUMA  1
#else
#define DEBUG_FAKE_NUMA  0
#endif

then the if statement can be written as:

if (DEBUG_FAKE_NUMA  && !RtsFlags.DebugFlags.numa) // faking NUMA
rts/sm/BlockAlloc.c
281

I think we replaced nat with uint32_t :).

hvr added inline comments.May 15 2016, 5:35 AM
rts/Capability.c
719

In fact, GHC HEAD will currently fail to build if the C compiler doesn't advertise C99 or better support!

bgamari requested changes to this revision.May 24 2016, 4:28 AM
bgamari edited edge metadata.

Here's another pass over this.

docs/users_guide/runtime_control.rst
664

Perhaps it would be worth mentioning that we only support four nodes?

682

How does this explicitly provided mask interact with the determined number of NUMA nodes (called N above)? Is N taken to be the number of set bits in the mask or the true number of nodes indicated by the operating system?

rts/Capability.c
719

We should make sure that this fact gets written down somewhere.

733

It seems to me that a comment explaining the reason for the stride here (nNumaNodes) would be helpful.

rts/RtsFlags.c
781

What if the OS says we have more than four nodes yet only a subset are selected in the user-provided mask?

792

If we are going to have this it seems it should really be documented somewhere (even if just in the debug RTS --help message).

rts/Task.c
434

Hmm, this seems a bit confusing but fair enough. Does DEBUG not have a 0/1 value? If so, we could simply write,

if (!DEBUG || (DEBUG && !RtsFlags.DebugFlags.numa))
     setThreadNode(task->node);
rts/sm/BlockAlloc.c
650

Presumably this means you could simply put the assert in an if (THREADED) block, no?

813–831

Why isn't this fair. I'm confused...

This revision now requires changes to proceed.May 24 2016, 4:28 AM
simonmar updated this revision to Diff 7722.May 24 2016, 10:30 AM
simonmar edited edge metadata.
  • various cleanups & fixes, address some comments
  • increase MAX_NUMA_NODES to 16
simonmar planned changes to this revision.May 24 2016, 11:13 AM
simonmar marked 2 inline comments as done.

I've addressed some of the comments (thanks!) and I'll address some more in the next revision.

docs/users_guide/runtime_control.rst
682

The mask is currently ignored, I haven't implemented this bit yet. It should really be a TODO. The intention is that N will be equal to the number of bits set in the mask.

rts/sm/BlockAlloc.c
650

No, it means the assertion is not true *during* GC. I'll update it.

813–831

because we should really free an even amount of memory on each node

simonmar updated this revision to Diff 7893.Jun 7 2016, 8:57 AM
simonmar edited edge metadata.
  • implement the node mask parameter to --numa
  • doc improvements
  • test --debug_numa
  • with nursery chunks, grab unused chunks from other nodes
simonmar marked 11 inline comments as done.Jun 7 2016, 2:40 PM

Ok, I think this is just about ready to go. It needs testing on more platforms though, I almost certainly broke Windows & Mac.

@simonmar, are you planning on taking care of this additional testing?

If anyone could help that would be great. It'll take me a while to get set up for testing on the other platforms.

Phyx added a subscriber: Phyx.Jun 9 2016, 1:43 PM

Validate results on Windows:

E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/compare_sizes/ghc.mk:8: recipe for target 'utils/compare_sizes/dist-install/build/tmp/compareSizes.exe' failed
make[1]: *** [utils/compare_sizes/dist-install/build/tmp/compareSizes.exe] Error 1
make[1]: *** Waiting for unfinished jobs....
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/hsc2hs/ghc.mk:16: recipe for target 'utils/hsc2hs/dist-install/build/tmp/hsc2hs.exe' failed
make[1]: *** [utils/hsc2hs/dist-install/build/tmp/hsc2hs.exe] Error 1
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/runghc/ghc.mk:30: recipe for target 'utils/runghc/dist-install/build/tmp/runghc.exe' failed
make[1]: *** [utils/runghc/dist-install/build/tmp/runghc.exe] Error 1
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/hpc/ghc.mk:21: recipe for target 'utils/hpc/dist-install/build/tmp/hpc.exe' failed
make[1]: *** [utils/hpc/dist-install/build/tmp/hpc.exe] Error 1
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/ghc-pkg/ghc.mk:83: recipe for target 'utils/ghc-pkg/dist-install/build/tmp/ghc-pkg.exe' failed
make[1]: *** [utils/ghc-pkg/dist-install/build/tmp/ghc-pkg.exe] Error 1
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts.a(MBlock.o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
utils/ghc-cabal/ghc.mk:89: recipe for target 'utils/ghc-cabal/dist-install/build/tmp/ghc-cabal.exe' failed
make[1]: *** [utils/ghc-cabal/dist-install/build/tmp/ghc-cabal.exe] Error 1
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts_thr.a(RtsFlags.thr_o): In function `procRtsOpts':
E:\msys32\home\Tamar\ghc/rts/RtsFlags.c:772: undefined reference to `osNumaAvailable'
E:\msys32\home\Tamar\ghc/rts/RtsFlags.c:779: undefined reference to `osNumaNodes'
E:\msys32\home\Tamar\ghc/rts/RtsFlags.c:786: undefined reference to `osNumaMask'
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts_thr.a(MBlock.thr_o): In function `getMBlocksOnNode':
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:\msys32\home\Tamar\ghc/rts/sm/MBlock.c:597: undefined reference to `osBindMBlocksToNode'
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts_thr.a(Task.thr_o): In function `workerStart':
E:\msys32\home\Tamar\ghc/rts/Task.c:432: undefined reference to `setThreadNode'
E:/msys32/home/Tamar/ghc/rts/dist/build/libHSrts_thr.a(Task.thr_o): In function `rts_setInCallCapability':
E:\msys32\home\Tamar\ghc/rts/Task.c:513: undefined reference to `setThreadNode'
collect2.exe: error: ld returned 1 exit status
`gcc.exe' failed in phase `Linker'. (Exit code: 1)
iserv/ghc.mk:55: recipe for target 'iserv/stage2/build/tmp/ghc-iserv.exe' failed
simonmar updated this revision to Diff 7924.Jun 10 2016, 3:31 AM
simonmar edited edge metadata.
  • Fix a bug in resizeNurseriesFixed
  • Turn the remaining allocBlock() calls into allocBlockOnNode()
  • Add NUMA stubs for win32, might make it compile
  • Update some comments
simonmar updated this revision to Diff 7925.Jun 10 2016, 4:15 AM
simonmar edited edge metadata.

More win32 stubs

@Phyx thanks for testing! I've added some stubs to the win32 code so that it should now compile. If you're able to re-test that would be really helpful.

Once win32 is done I think we should be ready to go.

Phyx edited edge metadata.Jun 10 2016, 7:28 AM

No problem,

Just two issues left:

rts\win32\OSThreads.c:324:6: error:
     error: no previous prototype for 'setThreadNode' [-Werror=missing-prototypes]
     void setThreadNode (uint32_t node STG_UNUSED) { /* nothing */ }
          ^

rts\win32\OSThreads.c:325:6: error:
     error: no previous prototype for 'releaseThreadNode' [-Werror=missing-prototypes]
     void releaseThreadNode (void) { /* nothing */ }
          ^
cc1.exe: all warnings being treated as errors
`gcc.exe' failed in phase `C Compiler'. (Exit code: 1)
rts/ghc.mk:255: recipe for target 'rts/dist/build/win32/OSThreads.l_o' failed
make[1]: *** [rts/dist/build/win32/OSThreads.l_o] Error 1
simonmar updated this revision to Diff 7929.EditedJun 10 2016, 9:00 AM
simonmar edited edge metadata.
  • Hopefully fix win32
  • Rebase
Phyx accepted this revision.Jun 10 2016, 11:26 AM
Phyx edited edge metadata.

Cheers, validates now.

Thanks @simonmar , Out of curiosity, do you have any idea what it would take to support this on Windows as well?

Would it just be filling in the proper stubs in OSMem and OSThread files?

Yes, implementing the new OSMem and OSThread APIs would be enough, but I don't know how easy that is - it will depend on how well those map to the equivalent Windows APIs.

This revision was automatically updated to reflect the committed changes.