Two step allocator for 64-bit systems
ClosedPublic

Authored by simonmar on Nov 24 2014, 3:23 PM.

Details

Summary

The current OS memory allocator conflates the concepts of allocating
address space and allocating memory, which makes the HEAP_ALLOCED()
implementation excessively complicated (as the only thing it cares
about is address space layout) and slow. Instead, what we want
is to allocate a single insanely large contiguous block of address
space (to make HEAP_ALLOCED() checks fast), and then commit subportions
of that in 1MB blocks as we did before.
This is currently behind a flag, USE_LARGE_ADDRESS_SPACE, that is only enabled for
certain OSes.

Test Plan

validate

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
master
Lint
Lint WarningsExcuse: fix this later
SeverityLocationCodeMessage
Warningincludes/rts/storage/MBlock.h:88TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 2267
Build 2277: GHC Patch Validation (amd64/Linux)
There are a very large number of changes, so older changes are hidden. Show Older Changes
simonmar added inline comments.Nov 25 2014, 3:14 PM
includes/rts/storage/MBlock.h
85

I think we want this to be a private symbol, so move it into rts/sm/MBlock.h, and the related macros too. That should mean you get one fewer memory references when the RTS is a shared library.

We can always provide an external API for this if people want it, but I seriously doubt anyone is using HEAP_ALLOCED outside of the RTS.

rts/posix/OSMem.c
95–96

Ha.

364

I'm less convinced about moving block descriptors, so it's good this patch is separate.

369

Does this work on OS X where we reserve the memory using vm_allocate?

370

these should be sysErrorBelch

373

ditto

gcampax added inline comments.Nov 25 2014, 9:01 PM
rts/posix/OSMem.c
95–96

I see what you mean. I tried to keep the convention to reserve/commit following the windows naming, but yes, it is sort of confusing that to get the reserve behavior we need to pass noreserve to mmap.
I'll add a comment to the enum.

369

Good question, I don't have an OS X machine to test on.
I believe it does though, because the previous code would also allocate with vm_allocate() and free with munmap().
I think munmap() is just a library call over the free equivalent of vm_allocate, but I'm not sure.

rts/sm/MBlock.c
88

Yes, allocation is O(n), but only for allocation of large (>= 2 MB) contiguous blocks, which should not happen often.
The worst case (for which I'm working on a test case) is a fully fragmented space, with 1 MB of allocation alternated to 1 MB of uncommitted memory, but hitting that is hard, because even for large objects the GC will evacuate the objects one after another, so objects "that are created at the same time" tend to be consecutive in memory, and this consecutive range of memory, even if the objects themselves are not freed at the same time, will be likely freed in one go at the next major GC.
If we find that O(n) is hit often and gets in the way, we could have different free lists for different sizes, that would point into the global free list (free O(n), allocate O(log size)), or go all the way to a binary tree.

On the issue of code duplication, one possibility would be to remove the free list in the block allocator altogheter, and immediately release the memory to the mblock allocator in free_mega_group. The mblock layer would then decide whether to wait for returnMemoryToOS, decommit or decommit+release free.
It should also consolidate the code between the different configure options - at the expense of having the free list always out of line.

188

Ugh, you're right, thanks for spotting this.

simonmar added inline comments.Nov 26 2014, 6:11 AM
rts/sm/MBlock.c
88

On the issue of code duplication, one possibility would be to remove the free list in the block allocator altogheter, and immediately release the memory to the mblock allocator in free_mega_group. The mblock layer would then decide whether to wait for returnMemoryToOS, decommit or decommit+release free. It should also consolidate the code between the different configure options - at the expense of having the free list always out of line.

Yes - if you were to create a benchmark that stresses this code path and show that there's little difference with the malloc-based free list then I'd be happy to go with this.

gcampax updated this revision to Diff 1743.Nov 26 2014, 7:14 PM

Updated with some of the review comments - now builds fine,
passes validate and should be bug free. I'll work on addressing
the other comments soon, but I wanted to give out a working
version in case some one else wants to play with it.

gcampax retitled this revision from RFC: two step allocator for 64-bit posix systems to Two step allocator for 64-bit systems.Nov 28 2014, 5:49 PM
gcampax updated this revision to Diff 1850.Dec 3 2014, 8:57 PM

Ready for merge, I hope.

Ok, all review comments have been addressed now, and this
passes validate locally on Linux and Windows.
I have two benchmarks for the O(n) allocation behavior
(in which the run time does not change significantly),
but I did not find how to submit nofib patches.

ezyang accepted this revision.Dec 3 2014, 11:04 PM

CR is all comment fixes, and one minor UI nit. But we need to test on Darwin before we can push this.

rts/posix/OSMem.c
96

but heh...?

134–135

We need to validate this on Darwin, which hasn't been done yet.

rts/sm/MBlock.c
39

The new state parameter is a bit funny, it would be best to say exactly why we need it (i.e. the specific use-case). I feel like there should be a better way to structure this.

46

together

47

iteration

48

space

95

say what this function does

122

And here, you should definitely explain what the state is for, in this concrete case.

196

I think it would be better for us to specifically report that we ran out of preallocated virtual address space, because this is not a strictly "out of memory" condition.

rts/win32/OSMem.c
437

MB = megablock, not megabyte!

testsuite/tests/rts/outofmem.stderr-ws-64
1 ↗(On Diff #1850)

Hm, is it inevitable that we lose the request size?

ezyang added a comment.Dec 5 2014, 1:37 AM

Build error on Darwin:

rts/posix/OSMem.c: In function ‘my_mmap’:

rts/posix/OSMem.c:149:0:
     error: ‘MAP_RESERVE’ undeclared (first use in this function)

rts/posix/OSMem.c:149:0:
     error: (Each undeclared identifier is reported only once

I'll try and fix it.

ezyang added a comment.Dec 5 2014, 2:27 AM

After fixing the typo, darwin validate errors

Unexpected failures:
   ../../libraries/unix/tests               user001 [bad stdout] (normal)
   cabal/cabal01                            cabal01 [bad exit code] (normal)
   cabal/cabal06                            cabal06 [bad stdout] (normal)
   cabal/sigcabal01                         sigcabal01 [bad stderr] (normal)
   ffi/should_run                           T2276_ghci [bad stdout or stderr] (ghci)
   ghc-api                                  T8628 [bad exit code] (normal)
   ghc-api                                  T8639_api [bad exit code] (normal)
   ghci/prog003                             prog003 [bad stderr] (ghci)
   ghci/scripts                             T8696 [bad stderr] (ghci)
   ghci/scripts                             ghci058 [bad stderr] (ghci)
   llvm/should_run/subsections_via_symbols  subsections_via_symbols [bad exit code] (normal)
   rts                                      T5435_dyn_asm [bad stdout] (normal)
   rts                                      linker_unload [bad exit code] (normal)

Unexpected stat failures:
   perf/compiler  T4801 [stat too good] (normal)
   perf/compiler  T9675 [stat not good enough] (optasm)

But I think a good chunk of these are unrelated.

ezyang added a comment.Dec 5 2014, 3:33 AM

Here is a baseline validate for comparison:

Unexpected failures:
   ../../libraries/unix/tests               user001 [bad stdout] (normal)
   cabal/cabal01                            cabal01 [bad exit code] (normal)
   cabal/cabal06                            cabal06 [bad stdout] (normal)
   cabal/sigcabal01                         sigcabal01 [bad stderr] (normal)
   ffi/should_run                           T2276_ghci [bad stdout or stderr] (ghci)
   ghci/prog003                             prog003 [bad stderr] (ghci)
   ghci/scripts                             T8696 [bad stderr] (ghci)
   ghci/scripts                             ghci058 [bad stderr] (ghci)
   llvm/should_run/subsections_via_symbols  subsections_via_symbols [bad exit code] (normal)
   rts                                      T5435_dyn_asm [bad stdout] (normal)

Unexpected stat failures:
   perf/compiler  T4801 [stat too good] (normal)
   perf/compiler  T9675 [stat not good enough] (optasm)
austin accepted this revision.Dec 8 2014, 9:54 AM

OK, LGTM then. @ezyang, do you want to lead this into HEAD and fix the Darwin failure with your patch?

ezyang requested changes to this revision.Dec 10 2014, 2:13 AM

OK, I've verified that these failures are specific to this patch. We better fix them.

=====> T8639_api(normal) 953 of 4339 [0, 0, 0] 
cd ./ghc-api && $MAKE -s --no-print-directory T8639_api    </dev/null >T8639_api.run.stdout 2>T8639_api.run.stderr
Wrong exit code (expected 0 , actual 2 )
Stdout:

Stderr:
T8639_api: internal error: stg_ap_v_ret
    (GHC version 7.9.20141209 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
make[3]: *** [T8639_api] Abort trap: 6

*** unexpected failure for T8639_api(normal)
=====> T8628(normal) 954 of 4339 [0, 1, 0] 
cd ./ghc-api && $MAKE -s --no-print-directory T8628    </dev/null >T8628.run.stdout 2>T8628.run.stderr
Wrong exit code (expected 0 , actual 2 )
Stdout:

Stderr:
T8628: internal error: stg_ap_v_ret
    (GHC version 7.9.20141209 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
make[3]: *** [T8628] Abort trap: 6

*** unexpected failure for T8628(normal)
This revision now requires changes to proceed.Dec 10 2014, 2:13 AM

Good news: the error tickles an ASSERT:

relocateSection: length = 2, thing = 0, baseValue = 0x10105803fc2  
relocateSection: looking up external symbol _exp  
               : type  = 1  
               : sect  = 0  
               : desc  = 0  
               : value = 0x0  
lookupSymbol: looking up _exp  
lookupSymbol: symbol not found  
lookupSymbol: looking up _exp with dlsym  
relocateSection: external symbol _exp, address 0x7fff94b667f0  
relocateSection: value = 0x7fff94b667f0  
T8628: internal error: ASSERTION FAILED: file rts/Linker.c, line 6626  
  
    (GHC version 7.9.20141209 for x86_64_apple_darwin)  
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

OK, it looks like we should set USE_CONTIGUOUS_MMAP to be true for Darwin, because Mach-O 32 bit displacements are occasionally not enough, it would seem...

Status report: this patch is still causing a panic on Mac OS X (stg_ap_v_ret), even when contiguous mmap is always enabled. I've been trying to tease out the bug but it is slow going.

ezyang added a comment.Mar 9 2015, 6:34 PM

We have a key insight: the bug does not occur with dynamic linking, so we think it's a bug in the Mach-O static linker where relocations are being resolved improperly when they are too big.

gcampax and I figured out how to fix this: the problem is PIC'd code still needs to mapped be in the low 8GB segment, but you run out of space due to the terabyte reservation. gcampax is cooking up a fix.

gcampax updated this revision to Diff 2849.Apr 26 2015, 7:46 PM

This is the updated patch to fix the aforementioned bug.
It validates fine on OS X and Linux, unfortunately after
rebasing it picked up a problem in Windows and now it
does not work.

I wanted to fix it before updating again, but I have realized
I do not have time at the moment, so here's the patch if
you want to take a look.
The error I get is "not enough swap space to perform
the operation", as if it was trying to commit 1TB of
memory. Debugging is unreliable though: it appears that
VirtualReserve succeeds, but then the program fails in
a way the debugger does not understand.
Hope anyone can figure this out, otherwise it will be
necessary to disable this code on Windows.

gcampax updated this revision to Diff 2850.Apr 26 2015, 7:49 PM

Fixed lint warning

I skipped it in the previous diff because I
did not want to lose the comment I had written.

simonmar added inline comments.May 1 2015, 3:07 AM
rts/posix/OSMem.c
392

*above?

396

I don't quite follow this - I understand about the 4GB limit (because we still use the small memory model, and PIC doesn't change this), but where does the 8GB come from?

austin requested changes to this revision.Jun 30 2015, 12:09 PM

(Marking as 'Request Changes' due to lack of Windows support atm it seems)

This revision now requires changes to proceed.Jun 30 2015, 12:09 PM

Alright, it looks like I need this. Going to work on getting it into shape.

simonmar commandeered this revision.Jul 17 2015, 8:49 AM
simonmar edited reviewers, added: gcampax; removed: simonmar.

imcommandeeringthisdiff

simonmar updated this revision to Diff 3578.Jul 17 2015, 9:53 AM

A little formatting and refactoring

simonmar updated this revision to Diff 3579.Jul 17 2015, 9:54 AM

Add a test

simonmar updated this revision to Diff 3580.Jul 17 2015, 9:55 AM

Check the high-water mark when coalescing

I fixed the bug on Windows (update coming soon), but unfortunately I think this scheme might be doomed on Windows. If I run a make -j4, I get this:

VirtualAlloc MEM_RESERVE 1099512676352 bytes failed: The paging file is too small for this operation to complete.

So it looks like MEM_RESERVE has some limit that we're running into when there are multiple processes running simultaneously.

simonmar updated this revision to Diff 3583.Jul 17 2015, 2:39 PM
  • Fix Windows implementation (piggybacking on the old support for memory allocation didn't work, and it's simpler to just call VirtualAlloc directly).
  • disable use_large_address_space on Windows.

I say ship it. It's a bit disappointing that Windows can't handle this, but a confirmation that this was a Windows problem was enough to hook me up with Go having similar issues (since the reason I thought MEM_RESERVE would work on Windows was because this was what Go was doing):

	// Number of bits in page to span calculations (4k pages).
	// On Windows 64-bit we limit the arena to 32GB or 35 bits.
	// Windows counts memory used by page table into committed memory
	// of the process, so we can't reserve too much memory.
	// See https://golang.org/issue/5402 and https://golang.org/issue/5236.
	// On other 64-bit platforms, we limit the arena to 512GB, or 39 bits.
	// On 32-bit, we don't bother limiting anything, so we use the full 32-bit address.
	// On Darwin/arm64, we cannot reserve more than ~5GB of virtual memory,
	// but as most devices have less than 4GB of physical memory anyway, we
	// try to be conservative here, and only ask for a 2GB heap.

The links are good reading: https://golang.org/issue/5402 and https://golang.org/issue/5236 (scroll down).

Note that there is a little bit about Go relying on overcommitting causing problems for people on systems that don't overcommit; this is not relevant to us since we're not actually committing memory and relying on the OS to lazily initialize it if they don't actually use it.

ezyang accepted this revision.Jul 17 2015, 4:27 PM

Has this successfully validated on Darwin yet?

@ezyang, yes so the problem appears to be that Windows counts the size of page tables as committed memory, so the amount of memory you can reserve is limited by this.

We will need to find another solution on Windows though - the current HEAP_ALLOCED() implementation is much worse than I thought when using multiple cores with a large heap size, due to contention for the lock in the HEAP_ALLOCED_miss() path. We've been suffering pretty badly from this in our system.

Could someone validate on OS X for me? I don't have a machine set up to do this.

The last patch I submitted was successfully tested on Darwin.

I can make sure this one passes tests too, I have the build environment set up.

Well, we could always switch back to my old patch which copied all static data onto the dynamic heap >:-)

I observed the following test failures that might be related (but might also be a problem in the test or the driver, because the test program behaves reasonably, just the stdout doesn't match)

rts/T9579         T9579_outofheap_rtsall [bad stdout] (normal)
rts/T9579         T9579_outofheap_rtsall_no_suggestions [bad stdout] (normal)
rts/T9579         T9579_outofheap_rtsnone [bad stdout] (normal)
rts/T9579         T9579_outofheap_rtssome [bad stdout] (normal)
rts/T9579         T9579_stackoverflow_rtsall [bad stdout] (normal)
rts/T9579         T9579_stackoverflow_rtsall_no_suggestions [bad stdout] (normal)
rts/T9579         T9579_stackoverflow_rtsnone [bad stdout] (normal)
rts/T9579         T9579_stackoverflow_rtssome [bad stdout] (normal)

IIRC, this is because after this patch we can't easily say how many bytes were requested, is that right?

Well these tests are about heap exhaustion, not address space, so our behavior should not change...

OK nevermind, I checked again without the patch and I get the same failures (on 214596de224afa576a9c295bcf53c6941d6892e0, aka today's master, on OS X Yosemite, bootstrapped with ghc 7.10.1 from homebrew / clang 6.1.0).

Looks like this patch has no regressions on validate, so it should be good to go.

gcampax accepted this revision.Jul 21 2015, 12:48 AM
simonmar updated this revision to Diff 3640.Jul 22 2015, 11:50 AM

Update Windows comment

This revision was automatically updated to reflect the committed changes.