Make Windows linker more robust to unknown sections
ClosedPublic

Authored by Phyx on Sep 14 2015, 4:37 PM.

Details

Summary

The Windows Linker has 3 main parts that this patch changes.

  1. Identification and classification of sections
  2. Adding of symbols to the symbols tables
  3. Reallocation of sections

1

Previously section identification used to be done on a whitelisted basis. It was also exclusively being done based on the names of the sections. This meant that there was a bit of a cat and mouse game between GCC and GHC. Every time GCC added new sections there was a good chance GHC would break. Luckily this hasn't happened much in the past because the GCC versions GHC used were largely unchanged.

The new code instead treats all new section as CODE or DATA sections, and changes the classifications based on the Characteristics flag in the PE header. By doing so we no longer have the fragility of changing section names. The one exception to this is the .ctors section, which has no differentiating flag in the PE header, but we know we need to treat it as initialization data.

The check to see if the sections are aligned by 4 has been removed. The reason is that debug sections often time are 1 aligned but do have relocation symbols. In order to support relocations of .debug sections this check needs to be gone. Crucially this assumption doesn't seem to be in the rest of the code. We only check if there are at least 4 bytes to realign further down the road.

2

The second loop is iterating of all the symbols in the file and trying to add them to the symbols table.
Because the classification of the sections we did previously are (currently) not available in this phase we still have to exclude the sections by hand. If they don't we will load in symbols from sections we've explicitly ignored the in # 1. This whole part should rewritten to avoid this. But didn't want to do it in this commit.

3

Finally the sections are relocated. But for some reason the PE files contain a Linux relocation constant in them 0x0011 This constant as far as I can tell does not come from GHC (or I couldn't find where it's being set). I believe this is probably a bug in GAS. But because the constant is in the output we have to handle it. I am thus mapping it to the constant I think it should be 0x0003.

Finally, static linking *should* work, but won't. At least not if you want to statically link libgcc with exceptions support. Doing so would require you to link libgcc and libstd++ but also libmingwex. The problem is that libmingwex also defines a lot of symbols that the RTS automatically injects into the symbol table. Presumably because they're symbols that it needs. like coshf. The these symbols are not in a section that is declared with weak symbols support. So if we ever want to get this working, we should either a) Ask mingw to declare the section as such, or b) treat all a imported symbols as being weak. Though this doesn't seem like it's a good idea..

Test Plan

Running ./validate for both x86 and x86_64

Also running the specific test case for Trac #10672

make TESTS="T10672_x86 T10672_x64"

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
rework-windows-pe-linker
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 5626
Build 6101: GHC Patch Validation (amd64/Linux)
Build 6100: arc lint + arc unit
Phyx updated this revision to Diff 4158.Sep 14 2015, 4:37 PM
Phyx retitled this revision from to Make Windows linker more robust to unknown sections.
Phyx updated this object.
Phyx edited the test plan for this revision. (Show Details)
Phyx added reviewers: ezyang, thomie.
Phyx updated the Trac tickets for this revision.
Phyx added a subscriber: Restricted Project.
ezyang accepted this revision.Sep 14 2015, 4:58 PM
ezyang edited edge metadata.

I haven't checked the details but I quite like it. Can we get the test case added?

This revision is now accepted and ready to land.Sep 14 2015, 4:58 PM
Phyx added a comment.Sep 15 2015, 9:43 AM

@ezyang yeah that should be possible. I'll add it.

Phyx updated this revision to Diff 4163.Sep 15 2015, 12:08 PM
Phyx edited edge metadata.
  • PE: Added test for T10672
Phyx edited the test plan for this revision. (Show Details)Sep 15 2015, 12:11 PM
Phyx edited edge metadata.
Phyx updated this revision to Diff 4164.Sep 15 2015, 1:52 PM
  • PE: renamed test cases to fix harbormaster
Phyx edited the test plan for this revision. (Show Details)Sep 15 2015, 1:53 PM
Phyx edited edge metadata.
ezyang added inline comments.Sep 15 2015, 1:56 PM
rts/Linker.c
3741

I'm just thinking... could we link to a URL of the PE spec doc somewhere? :)

4380

Are these... mutually exclusive? Should we complain if multiple of these bits are set?

4721

This is 0x11 right? A more in-depth comment (c.f. your commit message) would be good here, for people who are just reading the code.

Phyx updated this revision to Diff 4166.Sep 15 2015, 5:23 PM
Phyx marked 3 inline comments as done.
  • PE: updated links and documentation on PE spec
  • PE: added note for linux relocation constant found in .obj files.
Phyx added a comment.Sep 15 2015, 5:24 PM

Updated based on feedback comments

rts/Linker.c
3741

There actually is a section at the start of the windows specific section of the Linker with links :). I have updated that section and replaced the dead links.

4380

The specs is a bit vague on this, but based on the characteristics of the reserved sections and different relocations for these I would say the _CNT_ flags are mutually exclusive.

But for us, it doesn't matter, because basically at the very end we just check to see if the section is anything other than an SECTIONKIND_OTHER and so any value of the _CNT_ flags would qualify it. This classification is then thrown away, which is a shame, if I still had it I wouldn't need the check on the section names on lines 4586-4594.

I want to rewrite these two loop so I keep the information around and can re-use it, but I thought it best to make a separate task for that, or perhaps it's better to do that now?

4721

Yeah this is the constant I was referring to. I've added a note for it.

ezyang added inline comments.Sep 15 2015, 5:28 PM
rts/Linker.c
4380

I'd save it for later.

austin requested changes to this revision.Sep 18 2015, 9:55 AM
austin edited edge metadata.

Nice! A few nits here and there.

rts/Linker.c
3745–3752

Consistent alignment FTW!

4415

goodone

4771–4782

mfw

confusedjackie

testsuite/tests/rts/T10672/CxxyLib.hs
1 ↗(On Diff #4166)

I'm sure the answer is obvious, but couldn't we just roll this foreign import into Main.hs? Or are the 3 modules each required together for the regression?

Just wondering. That's one less file and a few deleted lines in the Makefile...

testsuite/tests/rts/T10672/LICENSE
1 ↗(On Diff #4166)

This is the same 3-clause license we already use - could you just move this Copyright into the actual .hs file, as a comment at the top?

testsuite/tests/rts/T10672/Main.hs
1 ↗(On Diff #4166)

-- Copyright (C) 2015, Luke Iannini

testsuite/tests/rts/T10672/Printf.hs
4 ↗(On Diff #4166)

Paper? :)

testsuite/tests/rts/T10672/all.T
12 ↗(On Diff #4166)

Newline.

testsuite/tests/rts/T10672/cxxy.cpp
5 ↗(On Diff #4166)

This is redundant, so can we just get rid of all the #if's completely?

26 ↗(On Diff #4166)

Add a newline.

This revision now requires changes to proceed.Sep 18 2015, 9:55 AM
Phyx updated this revision to Diff 4203.Sep 19 2015, 9:35 AM
Phyx edited edge metadata.
  • PE: cleaned up test files
Phyx marked 5 inline comments as done.Sep 19 2015, 9:42 AM

Removed the CxxLib and LICENSE files and updated the rest with the feedback. :)

testsuite/tests/rts/T10672/Printf.hs
5 ↗(On Diff #4203)

Seems to be the SPJ paper "Template Meta-programming for Haskell" :)

Phyx planned changes to this revision.Sep 19 2015, 3:23 PM
Phyx marked an inline comment as done.

Just making sure the tests all still pass.

Phyx requested a review of this revision.Sep 19 2015, 11:38 PM

Tests passes, does it pass for you as well @thomie ? If so must be something in conjunction with D1242

Phyx added a comment.EditedSep 20 2015, 4:15 AM

@thomie it also passes together with D1242 for me, so I'm guessing it's something with the test if it does fail for you.

My guess is it can't find one of the libraries, but that shouldn't be the case since the GCC wrapper we make will point it to the right folder.
If you can reproduce the failure can you let me know the error?

thomie requested changes to this revision.Sep 20 2015, 4:36 AM
thomie edited edge metadata.
Makefile:6: recipe for target 'T10672_x64' failed

Stderr:
<command line>: user specified .o/.so/.DLL could not be loaded (addDLL: could not load DLL)
Whilst trying to load:  (dynamic) gcc_s_seh-1
Additional directories searched: (none)
ghc-stage2.exe: gcc_s_seh-1: The specified module could not be found.
make[1]: *** [T10672_x64] Error 1

*** unexpected failure for T10672_x64(normal)
This revision now requires changes to proceed.Sep 20 2015, 4:36 AM
Phyx added a comment.Sep 21 2015, 11:18 PM

Ok, this seems to be not passing for @thomie because the Haskell Linker can't load one of the dlls for some reason.
This logic is mostly in Linker.hs and addDll in Linker.c which I haven't touched yet.

@thomie is looking into the failure as I can't reproduce it locally.

Phyx added a comment.Sep 24 2015, 7:47 AM

I'll set up a VM this weekend to see if I can reproduce the issue so I can fix it.

Maybe someone on the #wtf can run this test and see if it fails for them too?

olsner added a subscriber: olsner.Sep 24 2015, 1:03 PM
Phyx updated this revision to Diff 4338.Sep 27 2015, 4:26 PM
Phyx edited edge metadata.
  • Testsuite: add mingw/bin to path on Windows
Phyx added a comment.Sep 27 2015, 4:27 PM

OK, So the test should be running now.
The reason this was failing for @thomie was that I had various GCC bin folder on my path. When Linker.hs fails to find the libraries via their short name it will pass the short name to Linker.c where addDLL will try various variants including lib%s.DLL. But try means just call LoadLibrary. LoadLibrary will then use the standard Windows loader way of searching for dependencies. Ultimately it checks the path and finds it.

That Linker.hs does not find libs using short name without them being on your path is a bug and I'll fix that in Trac #1883.
Another issue is that the Linker does not tell Windows how to resolve any dependencies of the library that you pass in, as LoadLibrary will automatically try to resolve any DLLs in the library's import table. As it happens libwinpthread-1.dll is a dependency to one of the libraries. Unless the`GCC` bin folder is on the path this will fail as well.

This has been patched by @thomie's patch which adds the inplace GCC's bin folder to the path when the tests are run. This also has the added bonus of allowing us to actually run the linked executable since the dependencies should be found now.

I will create a new ticket to properly solve this.

thomie added inline comments.Sep 28 2015, 10:22 AM
rts/Linker.c
4720

Table 4.10: Relocation Types (page 71) from http://www.x86-64.org/documentation/abi.pdf says:

NameValueCalcuation
R_X86_64_PC322S + A - P
R_X86_64_GOT323G + A
R_X86_64_3210S + A
R_X86_64_32S11S + A
R_X86_64_DTPOFF6417
  • You change the comment for relocation type 2 to R_X86_64_PC32. I don't think that is correct, because the table mentions calculation S + A - P for relocation type 2 (R_X86_64_PC32), whereas the original comment (R_X86_64_32, ELF relocation type 10) seems to match the code (S + A). It should just be made much more clear what is going on here with these relocation type renumberings.
  • The comment for case 3 seems correct (R_X86_64_32S), but the table says R_X86_64_GOT32 for relocation type 3. Please make it clear in a comment or note that our relocation type 3 corresponds to ELF relocation type 11 (if I understand it correctly).
  • The same question for case 17.
  • The values in the table are not in hexadecimal notation. Is it just a coincidence that relocation type 17 (0x11) corresponds to ELF relocation type 11 (R_X86_64_32S)?
  • Why do they all go into the same code block. What is the difference between IMAGE_REL_AMD64_ADDR32 and IMAGE_REL_AMD64_ADDR32NB? Note that in the case of ELF, R_X86_64_32 and R_X86_64_32S don't share the same code.
4787

I think this comment was actually correct. The code seems to match the calculation for R_X86_64_PC32 (S + A - P) (I don't know about the -4), not the one for R_X86_64_PLT32 (L + A - P) (although I don't know what L is here).

Phyx added a comment.Sep 30 2015, 4:01 PM

Sorry, haven't had the time to respond yet. Will make some time tomorrow.

Phyx updated this revision to Diff 4404.Oct 3 2015, 4:48 AM
Phyx marked 2 inline comments as done.
Phyx edited edge metadata.
  • PE: Updated comments in Linker.c
Phyx added inline comments.Oct 3 2015, 4:55 AM
rts/Linker.c
4720

You change the comment for relocation type 2 to R_X86_64_PC32. I don't think that is correct, because the table mentions calculation S + A - P for relocation type 2 (R_X86_64_PC32), whereas the original comment (R_X86_64_32, ELF relocation type 10) seems to match the code (S + A). It should just be made much more clear what is going on here with these relocation type renumberings.

Agreed, looking at the two numbers (ELF + PE) confused me, I tried to clear it up by adding both constants but the comment was already wrong and I didn't revisit them.

The comment for case 3 seems correct (R_X86_64_32S), but the table says R_X86_64_GOT32 for relocation type 3. Please make it clear in a comment or note that our relocation type 3 corresponds to ELF relocation type 11 (if I understand it correctly).

Yes this is correct, the switch values are the PE constant values not the ELF ones. I have added both constant values to the comments.

The same question for case 17.

17 doesn't have a PE equivalent. It's getting leaked into the PE file. I couldn't find where, but I did not find anything in the assembly file we generate.

The values in the table are not in hexadecimal notation. Is it just a coincidence that relocation type 17 (0x11) corresponds to ELF relocation type 11 (R_X86_64_32S)?

I must admit I don't really know here. The constant just doesn't exist in the PE spec so it's unclear to me what it was intended to be. But from looking at the code it was mapped to before (which seems to have been working before) It should be one of the R_X86_64_32[S] values.

Looking at the output of objdump and dumpbin also seem to confirm this:

dumpbin gets confused:

>dumpbin /relocations "testsuite\tests\rts\T10672\Main.o"
Microsoft (R) COFF/PE Dumper Version 12.00.40629.0
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file testsuite\tests\rts\T10672\Main.o

File Type: COFF OBJECT

RELOCATIONS #1
                                                Symbol    Symbol
 Offset    Type              Applied To         Index     Name
 --------  ----------------  -----------------  --------  ------
 00000026  ADDR32                     00000000        1B  talkToCxx
 0000002E  Unknown (0x0011)                            A
 0000007B  REL32                      00000000        1C  suspendThread

but looking at the output of objdump it shows that it's supposed to be R_X86_64_32S

$ objdump -r "testsuite\tests\rts\T10672\Main.o"

testsuite\tests\rts\T10672\Main.o:     file format pe-x86-64

RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE
0000000000000026 R_X86_64_32       talkToCxx
000000000000002e R_X86_64_32S      .text
000000000000007b R_X86_64_PC32     suspendThread

Why do they all go into the same code block. What is the difference between IMAGE_REL_AMD64_ADDR32 and IMAGE_REL_AMD64_ADDR32NB?

IMAGE_REL_AMD64_ADDR32 is a VA (including base address) and IMAGE_REL_AMD64_ADDR32NB is an RVA (without the base address).
My understanding is that since we're only doing relocation of object files (and not dynamic runtime relocations) that since the base is always going to be 0 that the VA and RVA will both be relative to 0. That's why they're mapped to the same thing.

Note that in the case of ELF, R_X86_64_32 and R_X86_64_32S don't share the same code.

They don't share the same code, but the code does the same thing. The only difference is an extra overflow check on R_X86_64_32S.

4787

You're right, this does not seem to be going through the PLT, so R_X86_64_PC32 seems like the correct value. It was erroneously changed because I looked at the constant in the ELF relocation table instead of the code.

The displacement of 4 is explained in the x86 relocations:

Tricky.  We have to insert a displacement at
pP which, when added to the PC for the _next_
insn, gives the address of the target (S).
Problem is to know the address of the next insn
when we only know pP.  We assume that this
literal field is always the last in the insn,
so that the address of the next insn is pP+4
-- hence the constant 4.
Also I don't know if A should be added, but so
far it has always been zero.

SOF 05/2005: 'A' (old contents of *pP) have been observed
to contain values other than zero (the 'wx' object file
that came with wxhaskell-0.9.4; dunno how it was compiled..).
So, add displacement to old value instead of asserting
A to be zero. Fixes wxhaskell-related crashes, and no other
ill effects have been observed.

Update: the reason why we're seeing these more elaborate
relocations is due to a switch in how the NCG compiles SRTs
and offsets to them from info tables. SRTs live in .(ro)data,
while info tables live in .text, causing GAS to emit REL32/DISP32
relocations with non-zero values. Adding the displacement is
the right thing to do.
thomie accepted this revision.Oct 3 2015, 5:50 AM
thomie edited edge metadata.

Ok, let's go!

austin accepted this revision.Oct 3 2015, 12:58 PM
austin edited edge metadata.

I give you +1 seal.

sealofapproval

This revision is now accepted and ready to land.Oct 3 2015, 12:58 PM
This revision was automatically updated to reflect the committed changes.