Change runtime linker to perform lazy loading of symbols/sections
ClosedPublic

Authored by Phyx on Jan 20 2016, 12:16 AM.

Details

Summary

The Runtime Linker is currently eagerly loading all object files on all
platforms which do not use the system linker for GHCi.

The problem with this approach is that it requires all symbols to be found.
Even those of functions never used/called. This makes the number of
libraries required to link things like mingwex quite high.

To work around this the rts was relying on a trick. It itself was compiled
with MingW64-w's GCC. So it was already linked against mingwex.
As such, it re-exported the symbols from itself.

While this worked it made it impossible to link against mingwex in user
libraries. And with this means no C99 code could ever run in GHCi on
Windows without having the required symbols re-exported from the rts.

Consequently this rules out a large number of packages on Windows.
SDL2, HMatrix etc.

After talking with @rwbarton I have taken the approach of loading entire
object files when a symbol is needed instead of doing the dependency
tracking on a per symbol basis. This is a lot less fragile and a lot
less complicated to implement.

The changes come down to the following steps:

  1. modify the linker to and introduce a new state for ObjectCode: Needed. A Needed object is one that is required for the linking to succeed. The initial set consists of all Object files passed as arguments to the link.
  1. Change ObjectCode's to be indexed but not initialized or resolved. This means we know where we would load the symbols, but haven't actually done so.
  1. Mark any ObjectCode belonging to .o passed as argument as required: ObjectState NEEDED.
  1. During Resolve object calls, mark all ObjectCode containing the required symbols as NEEDED
  1. During lookupSymbol lookups, (which is called from linkExpr and linkDecl in GHCI.hs) is the symbol is in a not-yet-loaded ObjectCode then load the ObjectCode on demand and return the address of the symbol. Otherwise produce an unresolved symbols error as expected.
  1. On unloadObj we then change the state of the object and remove it's symbols from the reqSymHash table so it can be reloaded.

This change affects all platforms and OSes which use the runtime linker.
It seems there are no real perf tests for GHCi, but performance shouldn't
be impacted much. We gain a lot of time not loading all obj files, and
we lose some time in lookupSymbol when we're finding sections that have
to be loaded. The actual finding itself is O(1) (Assuming the hashtnl is perfect)

It also consumes slighly more memory as instead of storing just the Address of
a symbol I also store some other information, like if the symbol is weak or not.

This change will break any packages relying on renamed POSIX functions
that were re-named and re-exported by the rts. Any packages following
the proper naming for functions as found on MSDN will work fine.

Test Plan

./validate on all platforms which use the Runtime linker.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
There are a very large number of changes, so older changes are hidden. Show Older Changes
Phyx updated this object.Feb 18 2016, 2:47 PM
Phyx edited edge metadata.
In D1805#53370, @thomie wrote:

validate --fast passes for me on OSX 10.11.2

You'll have to set DYNAMIC_GHC_PROGRAMS = NO in mk/validate.mk (that you create yourself), for this to be an effective test. By default ghc uses the system linker on Mac.

solaris is on the NoSharedLibsPlatformList though, maybe @kgardas should do a validate there.

@thomie solaris is in NoSharedLibs only in case of SOLARIS_BROKEN_SHLD being set to YES, which is only in case of Solaris 10. So far on all Solaris 11 updates shared linker was working well...

Phyx updated this revision to Diff 6774.Feb 25 2016, 3:36 PM
  • T11223: Removed white spaces
  • T11223: Fixing linux build
  • T1223: Fixed linux tests
Phyx added a comment.Feb 25 2016, 3:38 PM

It's ready for another review whenever you guys have time @simonmar and @rwbarton :)

erikd added a comment.Feb 25 2016, 6:56 PM

I'll be testing this on linux/powerpc and linux/armhf today.

Phyx added a comment.Feb 26 2016, 1:26 AM

Thanks @erikd :)
@RyanGlScott it should be ok to test SDL2 and logfloat using this version too.

The rebase won't add anything major, just fix package deps. Will commit that tonight.

Phyx edited edge metadata.Feb 26 2016, 1:26 AM
Phyx updated the Trac tickets for this revision.
erikd added a comment.EditedFeb 26 2016, 2:29 AM

Linux/powerpc worked fine, but linux/armhf is currently busted due to https://ghc.haskell.org/trac/ghc/ticket/11649 .

Phyx updated this revision to Diff 6782.Feb 27 2016, 1:38 AM
  • T11223: Fixed GHC Prim for use new linker
Phyx added a comment.Feb 27 2016, 1:41 AM

Thanks @erikd !
Diff has been rebased on master

@Phyx, do you mind rebasing on top of master again? Unfortunately, the commit you rebased on doesn't include a fix that is needed to compile the free library (a transitive dependency of sdl2).

I did notice something which might have broken with this change. A simple GLFW-b example now fails to interpret with the error Segmentation fault/access violation in generated code. Here is a gist with the code and runghc output.

Phyx updated this revision to Diff 6794.Feb 28 2016, 1:19 AM
Phyx edited edge metadata.
  • Rebased to current master HEAD
Phyx added a comment.EditedFeb 28 2016, 1:27 AM

@RyanGlScott Rebased, hmm that may be the issue with the static initializers that I've been trying to track down.
Haven't changed anything there but it seems to not like it when the object file has a .ctors section with no static constructors.

I'll take a look, could you also try commenting out the code on line 4337 in Linker.c? If it's the same bug then there wasn't a static initializer anyway and should run.

I see this same error with HEAD: Segmentation fault/access violation in generated code, when running ./validate --slow. Mostly in tests for the prof_ways.

Phyx added a comment.Feb 28 2016, 4:41 AM

HEAD fails a validation build here,

rts\ProfHeap.c:748:24: error:
     error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'long long unsigned int' [-Werror=format=]
           fprintf(hp_file, "VOID\t%lu\n", (unsigned long)(census->void_total) * sizeof(W_));

I didn't re-run validate after the rebase :/

Set WERROR= in mk/validate.mk to work around that, and continue with ./validate --no-clean --slow.

Phyx added a comment.Feb 28 2016, 8:50 AM
In D1805#57530, @thomie wrote:

I see this same error with HEAD: Segmentation fault/access violation in generated code, when running ./validate --slow. Mostly in tests for the prof_ways.

Hmm I don't have a segfault, only assert failures and failures due to test output changes. Can you give me a test that's failing for you?

To keep better track of which libraries are failing on which versions of GHC, I've started a gist here.

@Phyx, by line 4337, do you mean this one?

4337                 (*init)(argc, argv, envv);

I commented that out, but still experienced the same error when running the GLFW-b example in GHCi.

$ make TEST=T5313 WAY=profasm
... 
Segmentation fault/access violation in generated code
...

If you can't reproduce that, it might be caused by the old msys2 I'm using:

$ uname -a
... 2.0.0(0.275/5/3) ...

Ryan: which msys2 version are you using?

According to uname -a:

MINGW64_NT-10.0 T450s-Win64 2.4.1(0.294/5/3) 2016-02-03 10:57 x86_64 Msys
Phyx added a comment.Mar 3 2016, 3:08 PM

Sorry for the disappearance, have been a bit busy.

I'm on MINGW64_NT-10.0 Kino 2.4.0(0.292/5/3) 2015-11-23 08:33 x86_64 Msys but I can't reproduce the profasm failure.
But the build isn't entirely clean so I'll start a rebuild before bed. In the mean time I'll take a look at the GLFW-b failure.

bgamari requested changes to this revision.Mar 21 2016, 9:39 AM
bgamari edited edge metadata.

How is this looking @Phyx?

This revision now requires changes to proceed.Mar 21 2016, 9:39 AM
Phyx added a comment.Mar 21 2016, 3:54 PM

@bgamari I had thought the problem was related to a similar bug in GCC but the updated binaries still give the same issue. So since the patch isn't that big I will just try to do a manual bisect and see what triggers it. Hopefully I can find the cause of the weird segfault. Have a long weekend coming up so more time, I am hoping to put this to bed this week.

Phyx updated this revision to Diff 7045.Mar 25 2016, 9:21 AM
Phyx edited edge metadata.
  • T11223: Fixed segfault by forcing atexit from msvcrt to be used instead of libmingw32

Dishearteningly, I still get the error Segmentation fault/access violation in generated code when trying to interpret either this GLFW example via runghc.

Phyx added a comment.Mar 27 2016, 3:52 AM

@RyanGlScott The segfault doesn't seem to be the fault of the runtime linker.

The compiled version of the example fails too.

In fact the smallest reproducible example is:

module Main where

import qualified Graphics.UI.GLFW as G

main :: IO ()
main = do
  successfulInit <- G.init
  return ()

The example also seems to *sometimes* work and I see the OpenGL Window. Suggesting that this is some kind of timing issue and not linking..

The offending function comes from bindings-GLFW and is the _glfwPlatformGetMonitors function.

I'll keep looking into it, but I don't think this patch is the issue. Was this working before?

Hm, I could have sworn it was working in GHCi 7.10.3, but if I try running it multiple times in succession, it does seem to unpredictably fail sometimes with an error. What is different depending on the version of GHC you're using is the exact error message.

  • On GHCi 7.10.3, if it fails when running on MSYS2, it will simply exit with the message Segmentation fault. On PowerShell, a window will open with the error message ghc.exe has stopped working (which is probably Windows's way of dealing with segfaults).
  • On GHCi 8.1 with your changes, running it on either PowerShell or MSYS2 will make it simply exit with the message Segmentation fault/access violation in generated code.

I don't know if the difference in error message is significant or not. It could just be that your changes improved exception handling and caused this line of code to be run before the whole program came screeching to a halt. Either way, it sounds like this isn't a GHCi regression.

One thing that is troubling, though, is that you claim the compiled executables don't work either. On my system, the compiled executables work reliably on both GHC 7.10.3 and 8.1 (with your changes).

Phyx added a comment.Mar 29 2016, 5:10 PM

Hm, I could have sworn it was working in GHCi 7.10.3, but if I try running it multiple times in succession, it does seem to unpredictably fail sometimes with an error. What is different depending on the version of GHC you're using is the exact error message.

  • On GHCi 7.10.3, if it fails when running on MSYS2, it will simply exit with the message Segmentation fault. On PowerShell, a window will open with the error message ghc.exe has stopped working (which is probably Windows's way of dealing with segfaults).
  • On GHCi 8.1 with your changes, running it on either PowerShell or MSYS2 will make it simply exit with the message Segmentation fault/access violation in generated code. I don't know if the difference in error message is significant or not. It could just be that your changes improved exception handling and caused this line of code to be run before the whole program came screeching to a halt. Either way, it sounds like this isn't a GHCi regression.

Yeah I'm not sure if that patch was ever back ported to 7.10, but essentially without it segfaults and other OS faults are unhandled in x86_64 builds. Because they are you get the standard Watson crash screen.

One thing that is troubling, though, is that you claim the compiled executables don't work either. On my system, the compiled executables work reliably on both GHC 7.10.3 and 8.1 (with your changes).

The compiled full example does seem to work, but the minimal example I posted, e.g. just calling init and stopping crashes. The linking should be pretty deterministic as the link order doesn't change between invocations, so because it does sometimes works it does seem like it's more of a program issue. Do any of your other libraries you use still have problems?

In D1805#60245, @Phyx wrote:

The compiled full example does seem to work, but the minimal example I posted, e.g. just calling init and stopping crashes. The linking should be pretty deterministic as the link order doesn't change between invocations, so because it does sometimes works it does seem like it's more of a program issue. Do any of your other libraries you use still have problems?

Huh, that's quite odd. In any event, it sounds like this is a issue specific to the GLFW library itself, not to GHC.

I haven't noticed any other C-based Haskell libraries breaking in GHCi (other than ones which were already broken), so you now have my unequivocal approval :)

Phyx added a comment.Mar 30 2016, 12:58 AM
In D1805#60245, @Phyx wrote:

The compiled full example does seem to work, but the minimal example I posted, e.g. just calling init and stopping crashes. The linking should be pretty deterministic as the link order doesn't change between invocations, so because it does sometimes works it does seem like it's more of a program issue. Do any of your other libraries you use still have problems?

Huh, that's quite odd. In any event, it sounds like this is a issue specific to the GLFW library itself, not to GHC.

I think it's still might be a GHC issue, just not one related to this patch. Looking more into this, the bindings-GLFW library heavily uses two global variables that it declares as extern

a Boolean _glfwInitialized and a state value _glfw .

The linker correctly initializes and returns an address for the extern values, but they seem to sometimes lose the value pointed to by the pointer. The pointer itself doesn't change so the linker is still returning the correct address. _glfwInitialized seems to no longer be dereference able once it leaves the function that sets it. Not really sure what could be the cause.

I haven't noticed any other C-based Haskell libraries breaking in GHCi (other than ones which were already broken), so you now have my unequivocal approval :)

Great! just needs someone to review the patch then :)

It looks like @rwbarton and you have done a great job working through this tricky area. Thanks to you both!

Here are some comments from my read-through; they are mostly addressing readability.

libraries/base/System/Posix/Internals.hs
404

A comment here describing the reason for the underscores would be helpful.

libraries/base/base.cabal
339–347

A brief comment here describing what these libraries are and what they respectively provide would be nice.

libraries/ghc-prim/ghc-prim.cabal
58

Something similar to the suggested base.cabal comment would be nice as well.

rts/Linker.c
157

Thanks for this note!

173

s/their/they are/ presumably.

203

Some discussion motivating the reason for this scheme would be great, along with a reference to Trac #11223.

545

Might I suggest another wording?
The existing symbol is weak with a zero value; replace it with the new symbol.

569–570

Indeed this does sound a bit dangerous. Perhaps we should simply fail in this case to be on the safe side.

2718

I can avoid storing the SymbolInfo I think if I move that information to ObjectCode. e.g. hold a set with all symbols in that ObjectCode that are weak. Then I can revert the SymbolInfo. Would that be better? memory wise I suppose it would since the majority of the symbols aren't weak.

If we land this patch without performing this refactoring could you open a ticket to ensure we don't forget about it, @Phyx?

testsuite/tests/rts/T11223/all.T
63

I don't understand; if this is so then why is there so much logic above dealing with weak symbols? If this really is true then we should be sure to note it in Linker.c with a reference to these tests.

@austin, you might also want to have a look at this; the more eyes on this linker business the better.

erikd accepted this revision.Mar 30 2016, 1:59 PM
erikd edited edge metadata.

WIth @bgamari 's suggested clarifications, this LGTM.

This was obviously a lot of work Thanks!

Phyx updated this revision to Diff 7151.Mar 31 2016, 12:53 AM
Phyx marked 8 inline comments as done.
Phyx edited edge metadata.
  • T11223: Added notes to release notes and POSIX Internals
  • T11223: Added comments to cabal files about new libraries
  • T11223: Updates notes and comments
  • T11223: Added weak symbols note.
Phyx added a comment.Mar 31 2016, 12:53 AM

Thanks for looking things over @bgamari and @erikd :)

I've updated the diff based on the comments.

rts/Linker.c
203

Hmm I'm afraid I don't quite get what you mean here :) Did you mean something on the way the runtime linker works on a whole or just how it's controlled by GHCi? Or something else?

545

ah yeah, the old one was rather confusing hehe

569–570

This has been changed, it does currently error out, it's the test case t_11223_simple_duplicate. The tests were made by comparing the behavior against ld's

2718

Sure, that's no problem. I don't mind doing it now though either, so if prefer it done now then I can do it :)

testsuite/tests/rts/T11223/all.T
63

Yeah, so what happened was, in the earlier review we noted that the behavior of the weak symbols were incorrect and hence all the code to correct it. only afterwards did I notice that weak symbols isn't actually supported at the moments. symbols with value 0 would for instance be ignored by the current linker.

It turns out, what was implemented was a partial COMDAT support which supported only part of what's required for full weak symbols support.

I initially set out to correct this, but then I couldn't see clearly the form the patch should take. Basically I was conflicted between if the patch should only support weak symbols in terms of linking C code with other C code (this should be mostly working now), or if weak symbols should be useable from Haskell too. In which case a foreign import should be able to return a maybe or something.

But I figured this should be another ticket.

Phyx added inline comments.Mar 31 2016, 1:04 AM
libraries/base/System/Posix/Internals.hs
404

Since this impacts user libraries as well, I added a note to docs/users_guide/8.0.1-notes.rst as well. @thomie suggested adding an entry to the migration guide wiki earlier, is that still needed?

Phyx updated this revision to Diff 7231.Apr 10 2016, 8:33 AM
Phyx edited edge metadata.
  • T11223: changed issue number in tests for weak symbols
Phyx added a comment.Apr 10 2016, 8:33 AM

Rebased on master and updated tests with new issue numbers.

rts/Linker.c
2718

Issue Trac #11816 has been created for this task.

testsuite/tests/rts/T11223/all.T
63

Ticket Trac #11817 has been created for this.

Almost there!

Thank you so much for your patience, @Phyx!

rts/Linker.c
203

I just mean explain why we want lazy loading behavior. You articulate this quite nicely in the diff description; it would be nice to rehash it here.

636

Thanks for the comment!

2718

Thanks! Now can you add a comment here mentioning the ticket to ensure that it is seen later?

This revision was automatically updated to reflect the committed changes.
Phyx added a comment.Apr 11 2016, 12:16 AM

Ah no problem :)

I fell asleep before finishing the notes. I'll commit those separately.

lukexi added a subscriber: lukexi.EditedJul 8 2016, 9:47 PM

@Phyx @RyanGlScott oh wow, so nice to hear of this investigation! The non-deterministic GLFW crash has been plaguing Rumpus (https://github.com/lukexi/rumpus) for months and driving me totally mad. I only managed to release at all because I found a super black magic workaround that seems to make it happen much more rarely but I barely understand it.

So first off it would add years to my life to track that particular issue down and I'd love to help.

Secondly, I'm now trying to upgrade Rumpus to GHC 8 to take advantage of all these wonderful improvements but am hitting an issue: the initial port seemed to go fine and I can run rumpus from the MSYS2 command line without issue. But when I try to run the standalone EXE it fails with

rumpus.exe: addDLL: mingw32 (Win32 error 126): The specified module could not be found.

I tried copying in libmingw32.a to be a rumpus.exe sibling like the rest of the required DLLs but it didn't help.

I guess this dependency comes from the lines you added in ghc-prim.cabal? But I have no idea how to fix it.

Phyx added a comment.Jul 8 2016, 11:43 PM
In D1805#69313, @lukexi wrote:

So first off it would add years to my life to track that particular issue down and I'd love to help.

I think we've fixed it in the version in HEAD that hasn't been released yet https://ghc.haskell.org/trac/ghc/ticket/12031 GLFW should run without issues now.
If it doesn't, let me know :)

Secondly, I'm now trying to upgrade Rumpus to GHC 8 to take advantage of all these wonderful improvements but am hitting an issue: the initial port seemed to go fine and I can run rumpus from the MSYS2 command line without issue. But when I try to run the standalone EXE it fails with

rumpus.exe: addDLL: mingw32 (Win32 error 126): The specified module could not be found.

Oh this is interesting, I hadn't considered this. I'm guessing Rumpus uses TH? Which would explain the error... We usually ask GCC for the location of libmingw32.a but since it's already been compiled gcc is no longer there to be asked. Weird that the testsuite didn't catch this..

I tried copying in libmingw32.a to be a rumpus.exe sibling like the rest of the required DLLs but it didn't help.

This could work, but findArchive doesn't have the same semantics as findDll, so it never considers the program directory as a search path.
Without gcc there, it defaults to looking for it in *only* paths it received via -L commands.

I guess this dependency comes from the lines you added in ghc-prim.cabal? But I have no idea how to fix it.

Yes, it's there for the C99 support base now needs. I don't think it actually needs it anymore for TH since all the symbol base needs are already there,
but I'll think about it some more after I wake up...

As a workaround, if you have it in the same folder as the exe, in the library part of your cabal file add extra-lib-dirs: . which should get passsed on
to findArchive.

You are a hero Tamar!! Thanks so much. Will try your workaround right now and let you know.

lukexi added a comment.Jul 9 2016, 2:15 AM

Hmm, no luck I'm afraid. I have libmingw32.a and libmingwex.a as siblings of rumpus.exe, added the extra-lib-dirs: . to both the Rumpus library and executable, and also added it to the libraryPaths parameter of DynFlags where I initialize the GHC API (Rumpus integrates a GHC interpreter, which is likely crucial information : ) - and yes, it uses Template Haskell in both its own code and when interpreting)

Phyx added a comment.Jul 9 2016, 3:08 AM
In D1805#69317, @lukexi wrote:

Where I initialize the GHC API (Rumpus integrates a GHC interpreter, which is likely crucial information : ) - and yes, it uses Template Haskell in both its own code and when interpreting)

Oh, that is indeed interesting. Are you on IRC by chance? would go a bit faster, otherwise i'll create a new ticket and we can move the discussion there :). Also where in the Rumpus do you use the API?
Can I install it easily? What I'm wondering, if you're using the API to actually compile code then it needs a GCC (unless i'm mistaken..)

lukexi added a comment.Jul 9 2016, 3:20 AM

Yes, good idea, I'll hop on now! I'll answer your questions there : )

lukexi added a comment.Jul 9 2016, 5:12 AM

And just to record the solution @Phyx found for me on IRC for future searchers, the solution was to both copy the newly-needed-as-of-8.0.1 libmingwex.a and libmingw32.a to be siblings of the exe, and to change the gcc impersonator (which is a trick by Manuel Chakravarty to avoid including a real gcc in a relocatable GHC API implementation, since it's just used to locate libraries) to return absolute paths for those particular files: https://github.com/lukexi/rumpus/blob/ghc8/util/fakeGCC.hs