Support more than 64 logical processors on Windows

Authored by Phyx on Sep 17 2016, 10:25 AM.



Windows support for more than 64 logical processors are implemented
using processor groups.

Essentially what it's doing is keeping the existing maximum of 64 processors
and keeping the affinity mask a 64 bit value, but adds an hierarchy above that.

This support was added to Windows 7 and so we need to at runtime detect if
the APIs are still there due to our minimum supported version being
Windows Vista.

The Maximum number of groups supported at this time is 4, so 256 logical cores.
The group indices are 0 based. One thread can have affinity with multiple

and particularly helpful is the whitepaper:
'Supporting Systems that have more than 64 processors'

Processor groups are not guaranteed to be uniformly distributed nor guaranteed
to be filled before a next group is needed. The OS will assign processors to
groups based on physical proximity and will never partially assign cores from
one physical cpu to more than one group. If one has two 48 core CPUs then you'd
end up with two groups of 48 logical cpus. Now add a 3rd CPU with 10 cores and
the group it is assigned to depends where the socket is on the board.

Test Plan

./validate or make test -c . in the rts test folder.

This tests for regressions, to test this particular functionality itself:

<program> +RTS -N -qa -RTS

Test is detailed in description.

Diff Detail

rGHC Glasgow Haskell Compiler
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
Phyx updated this revision to Diff 8731.Sep 17 2016, 10:25 AM
Phyx retitled this revision from to Support more than 64 logical processors on Windows.
Phyx updated this object.
Phyx edited the test plan for this revision. (Show Details)
Phyx added a reviewer: simonmar.
Phyx updated the Trac tickets for this revision.
Phyx added a subscriber: Restricted Project.
bgamari edited edge metadata.Sep 18 2016, 9:36 AM

Great work @Phyx! Perhaps we can find a Rackspace machine which is suitable for testing this.

Phyx added a comment.Sep 18 2016, 1:54 PM

@bgamari, it seems that the user who reported the bug has a machine he is able to test this and the numa support on. So I'm looking into producing two builds for him. or rather merging the two together.

simonmar requested changes to this revision.Sep 18 2016, 9:28 PM
simonmar edited edge metadata.

Great! Just a few nits.


x86_HOST_ARCH, not x86_BUILD_ARCH (in other places too)


Did you want a \n at the end of this string?


Should free(mask) here too

This revision now requires changes to proceed.Sep 18 2016, 9:28 PM
Phyx updated this revision to Diff 8776.Sep 25 2016, 11:20 AM
Phyx edited edge metadata.
Phyx marked 3 inline comments as done.
  • T11054: Update code based on review
  • T11054: Added non-uniform cpu groups support.
  • T11054: final cleanup.
Phyx updated this object.Sep 25 2016, 11:28 AM
Phyx edited the test plan for this revision. (Show Details)
Phyx edited edge metadata.
Phyx added a comment.EditedSep 25 2016, 12:11 PM

@simonmar @bgamari Thanks for looking things over. I have updated the code because it turns out the OS can distribute the groups a lot differently than I expected. I have updated the description to indicate this.

I have also been able to test this properly. As it turns out, the Windows Kernel allows you to emulate CPU groups using the groupsize boot parameter. So I detail my test below:

My system is a quad core i7 with Hyper-threading. To test this I asked the OS to create restrict processor groups to 2 logical cores by changing my kernel boot parameters:

bcdedit.exe /set groupsize 2

Before this my cores distribution was

Logical Processor to Group Map:
********  Group 0

and afterwards

Logical Processor to Group Map:
Group 0:
Group 1:

Logical to Physical Processor Map:
Physical Processor 0 (Hyperthreaded):
Physical Processor 1 (Hyperthreaded):

Logical Processor to Socket Map:
Socket 0:
Socket 1:

So it correctly split the cores and is emulating a dual socket system. (I seem to have lost some cores in the process, but that's fine).

my test program is a simple prime number generator:

import Control.Concurrent
import Control.Monad

main = do ids <- replicateM 9 (forkIO $ print $ length $ primesToG 100000000000)

spin = do threadDelay 1000

primesToG m = 2 : sieve [3,5..m]
    sieve (p:xs) 
       | p*p > m   = p : xs
       | otherwise = p : sieve (xs `minus` [p*p, p*p+2*p..])
minus (x:xs) (y:ys) = case (compare x y) of 
           LT -> x : minus  xs  (y:ys)
           EQ ->     minus  xs     ys 
           GT ->     minus (x:xs)  ys
minus  xs     _     = xs

Testing this with GHC 8.0.1 and +RTS -N -qa -RTS I see the problem reported in Trac #11054. The program is restricted to 1 group and never breaches the 50% CPU mark while all 9 threads are saturated:

now testing with the GHC in this patch, first off it now correctly detects the amount of cores:

[*] Total number of active processors detected: 4
[*] Number of processor groups detected: 2
[*] Number of active processors in group 0 detected: 2
[*] Number of active processors in group 1 detected: 2
[*] Processor group map created
[*] Cumulative active processors for group 0: 0
[*] Cumulative active processors for group 1: 2

And CPU utilization goes up to the 75% mark (one core is reserved for other things and the main thread).

After that, the configuration can be cleaned up

bcdedit.exe /deletevalue groupsize
Phyx updated this revision to Diff 8777.Sep 25 2016, 12:17 PM
  • T11054: Added release notes.
simonmar accepted this revision.Sep 26 2016, 8:11 AM
simonmar edited edge metadata.

Just a few small changes, I'll accept and let you fix the nits.


Will sensible things happen if this is null?


please rename.. getProcessorsCumulativeSum?


Please add a short comment to say what the function does.


It is better not to have a memory leak like this, instead move cpuGroupCache out to the top level (still static) and arrange to free it later from hs_exit()

This revision is now accepted and ready to land.Sep 26 2016, 8:11 AM
Phyx updated this revision to Diff 8811.Sep 28 2016, 3:34 PM
Phyx marked an inline comment as done.
Phyx edited edge metadata.
  • T11054: Free static resources on exit.
Phyx marked 3 inline comments as done.Sep 28 2016, 3:39 PM

@simonmar thanks for the review. It's much appreciated. I've made the changes requested. Will just let it sit here a bit while I wait for feedback from the user who reported it.


Yes, the cache is initialized with the constant MAXIMUM_PROCESSORS. This should only but NULL on systems which don't support processor groups. On such systems the defaults should kick in, e.g. 1 processor group, and maximum of 64 processors. These systems don't make a distinction between Active and Inactive cores. So defaulting to the maximum allowed is fine. (for the mask, the actual number of processors available will be returned correctly by the other functions.).

If it's NULL here it should be defaulting to the exact same behavior the previous code had.

Phyx updated this revision to Diff 8815.Sep 28 2016, 7:55 PM
Phyx edited edge metadata.
  • Hopefully fix linux
This revision was automatically updated to reflect the committed changes.
simonmar added inline comments.Oct 3 2016, 6:51 AM

What is this #ifdef for? CMINUSMINUS should only be needed in header files that are included by both .c and .cmm source files.

Phyx added inline comments.Oct 3 2016, 4:08 PM

In includes/rts/OSThreads.h the functions such as getNumberOfProcessors are defined within such an #ifdef. Since this call only made sense when you can call these functions I added them there. Which means I can also only call it in that case.

When I didn't have this #ifdef here I got a compile error. I can remove it by of course moving the definition outside the block.

Phyx added a comment.Oct 5 2016, 3:35 PM

I think I can refactor our the ifdef If I don't group the functions, which might be desirable. I'll do that in a bit.

Phyx added inline comments.Oct 8 2016, 11:33 AM

@simonmar it seems that rts/PrimOps.cmm and rts/Exception.cmm include this header. So the compile fails without the guard.