RTS: Auto-size the allocation area depending on CPU cache size
AbandonedPublic

Authored by sjakobi on May 9 2018, 3:40 AM.

Details

Reviewers
bgamari
erikd
simonmar
Trac Issues
#13362
Summary

Looks for L3 and L2 caches and sets the size of the allocation area
to the size of the largest cache found.
If we can't find anything, use the existing default of 1MB.

Implemented for Linux, Windows, and Darwin.
Tested on Linux and 64bit Windows only so far.

sjakobi created this revision.May 9 2018, 3:40 AM

Have you tested the performance implications?
I remember some programs in nofib actually getting slower with larger allocation sizes.

So it doesn't seem obvious to me that having the allocation area match a cache size would be better in general.

Phyx added a subscriber: Phyx.EditedMay 9 2018, 12:38 PM

@sjakobi I'll start the 32 bit build for you to check the APIs, but some of these new allocation numbers are off the chart...

130	peak_megabytes_allocated value is too high:
131	    Expected    T4334(normal) peak_megabytes_allocated:  2 +/-1%
132	    Lower bound T4334(normal) peak_megabytes_allocated:  1 
133	    Upper bound T4334(normal) peak_megabytes_allocated:  3 
134	    Actual      T4334(normal) peak_megabytes_allocated: 24 
135	    Deviation   T4334(normal) peak_megabytes_allocated: 1100.0 %
`

So these would have to be resolved.. I would have expected an increase due to the larger allocation size, but not this much, I doubt the machine Harbormaster is running on has that big of an L2 or L3 cache.

edit: Actually, maybe it does... if Rackspace is using high end Xeons or I7s.. But then you'll need a way to force a particular value. or make this new behavior not the default as it'll make the testsuite too hardware dependent.

Phyx added inline comments.May 9 2018, 1:32 PM
rts/RtsFlags.c
149

why not return 1MB from minAllocAreaSize instead of 0. then you don't need this check here.

Phyx added a comment.May 9 2018, 5:36 PM

Ok, the 32-bit Windows build seems broken currently, not due to your changes though. I'll have to fix that before we can validate yours.

dfeuer added a subscriber: dfeuer.May 9 2018, 6:03 PM

Have you tested the performance implications?
I remember some programs in nofib actually getting slower with larger allocation sizes.

So it doesn't seem obvious to me that having the allocation area match a cache size would be better in general.

We certainly want to make sure the nursery is not larger than the largest cache. I believe the correct choice almost certainly depends on the amount of parallelism in the program. In a highly-parallel program, the synchronization cost of GC is very high, so we probably want to approach the largest cache. In a mostly-serial program, it may be better to limit the nursery to a smaller, faster cache so the GC mostly hits objects that are already in cache. In the mostly-serial case, however, there will be considerable variation depending on program allocation behavior.

The only way to decide whether this is a good idea is to get data (and plenty of it). nofib on multiple different hardware types, the nofib/parallel suite too.

There are many factors that go into GC performance, many of them dependent on the characteristics of the program, so it's really hard to come up with a strategy that works well in general. Increasing the nursery size reduces the cost of GC overall (caches notwithstanding), so the numbers might look good even if you set the default to 20MB on a machine with a 1MB cache. However, we don't know how much memory the user is willing to give up to pay for this - that's why GHC has quite frugal settings by default.

The way nurseries are sized for parallel programs is arguably wrong, arguably the -A size should be divided amongst the capabilities rather than multiplied (with sensible defaults, we probably don't want to divide 1MB amongst 8 capabilities). I suppose we would have to introduce a new option for this, though.

Thanks for the insightful comments! I agree that this method is very likely too naive to result in good performance in the majority of scenarios.

Could someone still push this to a branch, so we can get nofib results for it?

rts/RtsFlags.c
149

I guess I wanted to have some separation of concerns. largestCpuCacheSize() is about investigating the platform, and IMHO shouldn't need to know about some RTS configuration defaults.

I've pushed this to the arcpatch-D4679 branch.

I don't see how perf.haskell.org will help you here. It just counts
instructions, so it's not going to tell you anything about the cache
effects you're after.

I don't see how perf.haskell.org will help you here. It just counts
instructions, so it's not going to tell you anything about the cache
effects you're after.

Oops, I misremembered that. Fewer instructions may mean fewer GCs, though!? I'll try to get some timings on my local machine when I get around to it!

I don't see how perf.haskell.org will help you here. It just counts
instructions, so it's not going to tell you anything about the cache
effects you're after.

Oops, I misremembered that. Fewer instructions may mean fewer GCs, though!? I'll try to get some timings on my local machine when I get around to it!

Wow, I'm not sure whether this is real but the perf.haskell.org changes are quite remarkable: https://perf.haskell.org/ghc/#compare/875b61ea38aa912d153a30027b51a4f12508bb9a/a98f8a7dd2bae870c45570525182690122666e90.

Wow, I'm not sure whether this is real but the perf.haskell.org changes are quite remarkable: https://perf.haskell.org/ghc/#compare/875b61ea38aa912d153a30027b51a4f12508bb9a/a98f8a7dd2bae870c45570525182690122666e90.

This doesn't entirely surprise me. Trading memory for fewer GCs can often have a dramatic effect, the question is how much memory is the user willing to give up to get it? We don't know that tradeoff up front. Perhaps GHC is too frugal by default, but I think I'd like to see a lot of data before making that decision.

Wow, I'm not sure whether this is real but the perf.haskell.org changes are quite remarkable: https://perf.haskell.org/ghc/#compare/875b61ea38aa912d153a30027b51a4f12508bb9a/a98f8a7dd2bae870c45570525182690122666e90.

This doesn't entirely surprise me. Trading memory for fewer GCs can often have a dramatic effect, the question is how much memory is the user willing to give up to get it? We don't know that tradeoff up front. Perhaps GHC is too frugal by default, but I think I'd like to see a lot of data before making that decision.

Yes, fair enough. I do wish it reported how much GC count changed by.

Anyways, we clearly need more data.

bgamari requested changes to this revision.Oct 16 2018, 8:36 AM

@sjakobi, what is your plan for this?

This revision now requires changes to proceed.Oct 16 2018, 8:36 AM
sjakobi abandoned this revision.Oct 18 2018, 2:17 PM

@sjakobi, what is your plan for this?

I probably won't work further on this. I hope someone else can do the necessary performance investigations and come up with a more solid plan.