The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.
Linux provides a NUMA API for doing two things:
- Allocating memory local to a particular node
- Binding a thread to a particular node
When given the +RTS --numa flag, the runtime will
- Determine the number of NUMA nodes (N) by querying the OS
- Assign capabilities to nodes, so cap C is on node C%N
- Bind worker threads on a capability to the correct node
- Keep a separate free lists in the block layer for each node
- Allocate the nursery for a capability from node-local memory
- Allocate blocks in the GC from node-local memory
For example, using nofib/parallel/queens on a 24-core 2-socket machine:
$ ./Main 15 +RTS -N24 -s -A64m Total time 173.960s ( 7.467s elapsed) $ ./Main 15 +RTS -N24 -s -A64m --numa Total time 150.836s ( 6.423s elapsed)
The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).
According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using --numa.
TODO: yes it needs some docs too.