[WIP] Add support for SIMD operations in the NCG
Needs ReviewPublic

Authored by Abhiroop on Jun 7 2018, 3:42 PM.

Details

Summary

This adds support for constructing vector types from Float#, Double# etc and performing arithmetic operations on them

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
simd
Lint
Lint WarningsExcuse: It's okay
SeverityLocationCodeMessage
Warningcompiler/cmm/CmmExpr.hs:501TXT3Line Too Long
Warningcompiler/cmm/PprC.hs:710TXT3Line Too Long
Warningcompiler/codeGen/CgUtils.hs:60TXT3Line Too Long
Warningcompiler/codeGen/StgCmmPrim.hs:1694TXT3Line Too Long
Warningcompiler/codeGen/StgCmmPrim.hs:1697TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:861TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:871TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:872TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:976TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:977TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:978TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:982TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:983TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:984TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:988TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:989TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:990TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:994TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:995TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:996TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:1096TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:1120TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:1147TXT3Line Too Long
Warningcompiler/nativeGen/X86/CodeGen.hs:1183TXT3Line Too Long
Warningtestsuite/tests/codeGen/should_run/cgrun079.hs:26TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 22899
Build 53939: [GHC] Linux/amd64: Continuous Integration
Build 53938: [GHC] OSX/amd64: Continuous Integration
Build 53937: [GHC] Windows/amd64: Continuous Integration
Build 53936: arc lint + arc unit
There are a very large number of changes, so older changes are hidden. Show Older Changes
Abhiroop added inline comments.Jun 18 2018, 2:42 PM
compiler/nativeGen/X86/CodeGen.hs
82

This is just a helper datatype that is used in line 960 of this file to avoid rewriting the nearly same function four times for all the four arithmetic instructions. This will be used while using for other datatypes like Int32, word32 etc

Abhiroop added inline comments.Jun 18 2018, 2:56 PM
compiler/cmm/CmmExpr.hs
619

how is register size used?

I mentioned the use case in the comments above. For Cmm like this

_B1::Fx4V128 = XMM1; //CmmAssign

Before my change, the above would break in the Cmm linter. To understand why this would break before, we need to see the notion of weak equality defined for the datatype CmmType: https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmType.hs#L84

because single precision "Float" is 32bit, and double precision "Double" is 64 bit ... what happens if xmm has one of those?

The level at which the linter is working it is not checking the width. It just takes the width whether W32 for Float or W64 for Double and generates a FloatCat https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmType.hs#L106

The linter works using the weak equality notion mentioned in the previous paragraph, it doesn't actually checks the width.

carter added inline comments.Jun 18 2018, 5:51 PM
compiler/cmm/CmmExpr.hs
619

wait ... WHY are you setting the register?

what do you mean by XMM1?

Abhiroop added inline comments.Jun 18 2018, 9:27 PM
compiler/cmm/CmmExpr.hs
619

I am not sure what you are asking but I have specifically not added anything extra.

For the simple lifting of an unlifted type code below:

data FloatX4 = FX4# FloatX4#

instance Show FloatX4 where
  show (FX4# f) = case (unpackFloatX4# f) of
    (# a, b, c, d #) -> show ((F# a), (F# b), (F# c), (F# d))


main :: IO ()
main = print (FX4# (plusFloatX4# (broadcastFloatX4# 1.3#) (broadcastFloatX4# 2.2#)))

The output cmm looks like this

[Main.FX4#_entry() //  [XMM1]
         { info_tbls: [(c1NK,
                        label: Main.FX4#_info
                        rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 9} }
                        srt: Nothing)]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       c1NK: // global
           _B1::Fx4V128 = XMM1;   // CmmAssign
           goto c1NM;   // CmmBranch
       c1NM: // global
           Hp = Hp + 24;   // CmmAssign
           if (Hp > HpLim) (likely: False) goto c1NO; else goto c1NN;   // CmmCondBranch
       c1NO: // global
           HpAlloc = 24;   // CmmAssign
           goto c1NL;   // CmmBranch
       c1NL: // global
           XMM1 = _B1::Fx4V128;   // CmmAssign
           R1 = Main.FX4#_closure;   // CmmAssign
           call (stg_gc_fun)(XMM1, R1) args: 8, res: 0, upd: 8;   // CmmCall
       c1NN: // global
           // allocHeapClosure
           I64[Hp - 16] = Main.FX4#_con_info;   // CmmStore
           Fx4V128[Hp - 8] = _B1::Fx4V128;   // CmmStore
           _c1NJ::P64 = Hp - 15;   // CmmAssign
           R1 = _c1NJ::P64;   // CmmAssign
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;   // CmmCall
     }
 },
 section ""data" . Main.FX4#_closure" {
     Main.FX4#_closure:
         const Main.FX4#_info;
 }]

As you can see the Cmm generated _B1::Fx4V128 = XMM1; // CmmAssign would not pass the Cmm linter before. This specifically fixes this.

Abhiroop updated this revision to Diff 17011.Jun 19 2018, 8:40 AM
  • Modify unpack to use shuffle operation and add tests for it

why do we have a normal CMM expression changing how we do code gen, this seems deeply wrong to me.

compiler/cmm/CmmExpr.hs
619

Ok, you *may* wanna also see about inspecting the CMM before register allocation as well as after.

it sounds like you're only looking at the post register allocation CMM, both should be looked at i think

compiler/nativeGen/X86/CodeGen.hs
1154

wait wait wait ...

why do we have "(CmmLit lit)"

whats this argument for? This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ), and even once we add / work out support for compile time constants, they won't / shouldn't be normal CMMExpressions in the CMM layer, they should be fields in the primop constructor!

Good progress.

Make it a habit to mark comments you consider dealt with as done. That way it's clear from everyone involved what we should still look at!

compiler/cmm/CmmCallConv.hs
123

This conflicts with the comment above!

compiler/nativeGen/Format.hs
108

^

compiler/nativeGen/X86/CodeGen.hs
82

Now we know. But every new person looking at it will wonder the same thing.

Add a short comment to explain this. I would also suggest moving it closer to where it is used. @carter what do you think?

312

^

811

Update the error message here.
Add should now fail because it requires two operands, not because it requires llvm I assume.

1098

Assuming the pattern match is exhaustive why not make this a function that just fills in the instruction?

1149

VPSHUFD requires AVX. Imo it is overly restrictive to require avx to unpack floats.

The AVX version is probably faster (but verify that!). If so we should generate different code depending on what the cpu supports.

Also why is it VMOVUPS but VPSHUFD.
Shouldn't that be dependent on the Format or at least match between the two?

Abhiroop marked 19 inline comments as done.Jun 19 2018, 10:48 AM
Abhiroop marked an inline comment as done.
Abhiroop marked an inline comment as done.Jun 19 2018, 10:50 AM

not sure about some bits, but heres some more thoughts

compiler/cmm/CmmCallConv.hs
123

I thought the ABI was changed for ghc 8.4 or 8.6?! hrmmm, this looks stale and needs more though

compiler/nativeGen/X86/CodeGen.hs
959–1001

negate should be easy ... dont we juxt do a floating point bitwise XOR with a mask thats 1 on the sign bit and zero everwhere else?

except ... one issue we're gonna need to think about is how all these operations behave for NAN! (i suppose i can help with that after we get everything else working)

1149

we are gonna have both SSE and AVX encodings of most things, the current work is simplifying in the following ways

  1. only handling 4xFloat# vectors
  2. first getting the AVX encoding working. (i think ideal end state is to have which instruction subset toggled by march style flags)

later we should figure out how to have per function definition "march / ISA flag" pragmas, so end users can toggle per definition (though per module wouldn't be that bad I guess?)

Abhiroop added inline comments.Jun 19 2018, 9:45 PM
compiler/nativeGen/X86/CodeGen.hs
1149

@AndreasK PS in VMOVUPS refers to single precision float and the D in VPSHUFD refers to double word (or 32 bits). So currently it works for 32 bit Floats.

I want this to be dependent on the Format type but vector instruction seems to have a lot of variations like VBROADCASTSS which I am yet to figure out what the SS stands for. For now I am hardcoding and not using the Format info in the pretty printer as well, but would eventually need to think of doing that.

Abhiroop added inline comments.Jun 19 2018, 9:50 PM
compiler/cmm/CmmCallConv.hs
123

I added this change many months back when I started to port with AVX2(256 bits ) instructions and didn't have a full understanding of the problems at hand. I can remove this.

AndreasK added inline comments.Jun 20 2018, 5:43 AM
compiler/nativeGen/X86/CodeGen.hs
1149

VPSHUFD refers to double word (or 32 bits).

I guess that makes sense as it can also be used for ints.

VBROADCASTSS which I am yet to figure out what the SS stands for.

As far as I know for SIMD instructions the PS postfix usually indicates packed single-precision while SS stands for scalar single-precision.

It's pretty clear that this holds up for VBROADCAST too if you look at the Intel instruction reference.

  • SS -> Broadcast the low single-precision floating-point element in the source operand to four locations in xmm1.
  • SD -> ... double precision ...
  • F128 -> Broadcast 128 bits of floating-point data ...

As soon as it isn't a scalar being broadcast the first S is dropped.

Abhiroop added inline comments.Jun 20 2018, 8:24 AM
compiler/nativeGen/X86/CodeGen.hs
1098

I am not sure I follow what you meant here.

Do you mean instead of using a separate datatype having VADD, VSUB etc constructors straightaway have constructors called VADDPS, VSUBPS and modify the Instr type accept a type like VectorArithInstns?

AndreasK added inline comments.Jun 20 2018, 9:37 AM
compiler/nativeGen/X86/CodeGen.hs
1098

I was thinking along the lines of that:

code dst = case op of
  -- opcode src2 src1 dst <==> dst = src1 `opcode` src2
  VADD -> arithInstr VADDPS
  VSUB -> arithInstr VSUBPS
  ....
    where
      arithInstr instr = unitOL (instr format (OpReg reg2) reg1 dst)
Abhiroop added inline comments.Jun 20 2018, 10:33 AM
compiler/nativeGen/X86/CodeGen.hs
312

I have been avoiding doing this change because in the getRegisterReg function:

getRegisterReg :: Platform -> Bool -> CmmReg -> Reg the second argument is a Bool checking the SSE2 flag. Most of my functions work with the AVX flag in which case I would have to add another parameter to the function and modify getRegisterReg almost everywhere, because it is frequently used.

I can obviously work around it sending a True in the second argument and having some hacks around this but I actually want to extend the getVecRegisterReg function in the future.

Keeping a separate getVecRegisterReg function (which currently has a parameter for just AVX flag) allows me to extend just this function later to include all the other families like SSE, SSE2 AVX2 etc.

I was thinking of creating a separate type:

data MicroArch = MicroArch { sse   :: Bool
                           , sse2  :: Bool
                           , avx   :: Bool
                           .... }

and modifying the signature of getVecRegisterReg to:

getVecRegisterReg :: Platform -> MicroArch -> CmmReg -> Reg

Abhiroop updated this revision to Diff 17033.Jun 20 2018, 11:27 AM
  • Add vector negate operation and tests
  • Fix other code review comments
Abhiroop marked 15 inline comments as done.Jun 20 2018, 11:37 AM
Abhiroop updated this revision to Diff 17034.Jun 20 2018, 11:48 AM
  • Add some explanatory comments
Abhiroop marked 7 inline comments as done.Jun 20 2018, 11:50 AM
AndreasK added inline comments.Jun 20 2018, 3:59 PM
compiler/nativeGen/X86/CodeGen.hs
312

and modify getRegisterReg almost everywhere, because it is frequently used.

There are only about 30 uses which concern us.
I get that it's annoying but most of these call it with True/False so
should be trivial to update. We might have spent more time discussing this by now than it would be to
change it.

I was thinking of creating a separate type:

That sounds like a great idea! Do you see issues stopping us from just making getRegisterReg take a MicroArch argument instead?
Seems like a much cleaner solution to me.

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?
I think there is no architecture which supports newer SIMD extensions but drops old ones. So we can define Ord and just do
(if arch >= sse2) then cleanSolution else fpStackHack and the like.

We often have the pattern for sse of:

use_sse2 <- sse2Enabled
....
getRegisterReg ... use_sse2

Then we could do somthing like below instead:

sse_level <-sseSupport :: m MicroArch
...
getRegisterReg pf sse_level reg

But I also understand the desire/need to get this working before you think about clean code.
I suggest to leave it as is for now and once it's time to extend getVecRegisterReg replace it with an extended version of getRegisterReg instead.

I've been playing around with code on this branch. If I want to merge code into it, how might I go about it?

Also, I've discovered my CPU doesn't actually have AVX512, but it has AVX2, so I'm looking to be able to write code against AVX2 flags.

Abhiroop added inline comments.Jun 21 2018, 8:39 AM
compiler/nativeGen/X86/CodeGen.hs
312

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?

I had initially thought of something almost along the exact same lines and then defining an Ord instance for the datatype. But in an old discussion with @carter he had actually mentioned it is not always true that each new family of microarchitecture would always be a superset of the older families. They might not always support the previous micro-architectures. I recall him mentioning that maybe not the Intel or AMD machines but certain custom x86 architectures have that property.

Also making MicroArch a sum type would imply us checking or working with one flag at a time. However, I thought of a record type because there will be situations where multiple flags might be switched on like MicroArch{ sse4 = True, avx = True, sse1 = False, .....} and the register allocation would have a fancier logic which would take into account this fact. Although, this might be me overthinking. :)

I put a test PR up to see if I can make a PR. Looks like the base is different to that in D4813:

https://github.com/Abhiroop/ghc-1/pull/1

Abhiroop updated this revision to Diff 17098.Jun 26 2018, 5:02 PM
  • Rewrite the pretty printer to use the Format information and generalize the instructions
  • Add support for packing, unpacking, broadcasting doubles
  • Add support for arithmetic operations on doubles
  • Add tests for the above and minor changes in the linter and other places to accomodate doubles
Abhiroop retitled this revision from [WIP] Add support for broadcastFloatX4# and unpackFloatX4# operations to [WIP] Add support for SIMD operations in the NCG.Jun 26 2018, 6:09 PM
Abhiroop edited the summary of this revision. (Show Details)
Abhiroop added inline comments.Jun 26 2018, 6:38 PM
compiler/cmm/CmmExpr.hs
618

This part currently works but this is not fully correct. This function maps an STG register to the respective CmmType. While this initially stated cmmVec 4 (cmmBits W32) this was incorrect because it can possibly hold 16 Int8s or 8 Int16s etc.

And even now it is stating that an XMM can hold only 2 64 bit doubles. While this change makes this code run, but we need to somehow modify the GlobalReg type to capture some more information or at least pass some more info(like length and width) to the function globalRegType so that a correct CmmType can be generated which covers all the cases.

compiler/nativeGen/X86/CodeGen.hs
1154

whats this argument for?

@carter when you mentioned that this should be a part of a prim-op constructor do note that in this operation I am using CmmLit just to represent an offset value. So are you suggesting that the offset value should be part of the prim-op constructor actually making it MO_VF_Extract Length Width Offset? Don't you think that is actually leaking details of the assembly onto the constructors of Cmm, the same thing that we were discussing that we want to avoid.

Maybe I didn't get what you meant, but I think your comment was just for the shuffle operations where we should have something on the lines of MO_V_Shuffle Length Width **Lit** where the third field of the constructor is the compile time literal.

This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ),

If you are asking about the utility of CmmLit which is a constructor of the CmmExpr datatype, I am not sure what purpose it serves. It existed before and is used in a number of places in the code generator. Also nothing prevents the literals from being in registers. Maybe someone with better idea of the codegenerator can answer what purpose CmmLit solves.

Abhiroop updated this revision to Diff 17115.Jun 28 2018, 4:35 PM
  • Disable the liveness analysis for vector operations when running with -O2
  • Add tests for packing and unpacking with -O2
Abhiroop updated this revision to Diff 17134.Jun 29 2018, 10:24 AM
  • Re-enable the liveness analysis of vector operations
  • Change the constructors and other handlers of packing operations
  • Temporarily disable inlining for vector operations
Abhiroop added inline comments.Jun 29 2018, 10:42 AM
compiler/cmm/CmmSink.hs
454 ↗(On Diff #17134)

@carter @AndreasK

I modified the arguments of MO_VF_Insert from something like %MO_VF_Insert_4_W32(1.5 :: W32, 0 :: W32); to something like %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32, 0 :: W32); where the first argument refers to the Cmm register where the packing is taking place.

While this allows me to remove my liveness hack, it gives rise to another issue that the Cmm liveness analysis has a clause on noLiveOnEntry which doesn't allow us to enter a block with a live register like _c1LB::Fx4V128. The noLiveOnEntry is switched on only for -O2. So for -O2 to work I have to define something like

_c1LB::Fx4V128 = <0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>;   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

The first line results that I have to anyways end up XORing the register in assembly. And it raises another complication in -O2. It tries to inline and the above Cmm becomes:

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(<0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32( %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

It not only inlines 0.0 :: W32, 0.0 :: W32, 0.0 :: W32, 0.0 :: W32> but also starts inlining the constructors which results that the pattern matches in the code generators starts failing.

So the above change disables the inlining temporarily for vector operations. Let me know if you have any better suggestions

AndreasK added inline comments.Jun 30 2018, 8:11 AM
compiler/cmm/CmmSink.hs
454 ↗(On Diff #17134)

which doesn't allow us to enter a block with a live register

You can avoid this by replacing one insert with a mov instruction like we talked about on IRC.

results that the pattern matches in the code generators starts failing.

Is this because you expect the arguments to the MachOP to already be a register? You should be able to handle that case using getSomeReg. I don't think we should special case this so much in the Cmm stage and rather just evaluate the expression into a register during X86 codegen when it's not already a register.

Abhiroop updated this revision to Diff 17237.Jul 8 2018, 3:28 PM
  • Enrich STG vector registers with Length width and register type information
  • Add support for the SSE family of vector instructions
carter added a comment.Jul 9 2018, 2:02 PM

just did a first passs on the latest changes, though only made my way through the first 50% :)

compiler/cmm/CmmCallConv.hs
78

you can definitely define a helper function instead of duplicating the code here twice

compiler/cmm/CmmExpr.hs
446

same issue applies to YMM and ZMM,

after all, XMM registers are just the lower 128 bits of those!

557–558

THIS LOOKS WRONG

compiler/codeGen/StgCmmPrim.hs
1653

a TODO we can address some other time, because there are C intrinsics per platform

Abhiroop marked 7 inline comments as done.Jul 9 2018, 8:11 PM
Abhiroop added inline comments.
compiler/cmm/CmmExpr.hs
446

As I was experimenting with this representation, I wanted to keep the surface area of the experiment as small as possible. But yes I will eventually extend this to YMM and ZMM registers.

557–558

Yes, I didn't understand what this comparison means!

Abhiroop updated this revision to Diff 17250.Jul 9 2018, 8:25 PM
  • Remove the inlining hack and allow -O2 to fully work for vectors
Abhiroop updated this revision to Diff 17313.Jul 14 2018, 12:01 AM
  • Modify all pack/broadcast variants to accept any CmmExpr argument and rewrite codegen for that
  • Modify the StgCmmPrim definition of broadcast
  • Modify CmmLint condition for broadcast
bgamari updated this revision to Diff 18170.Oct 1 2018, 9:46 PM

Rebase and address comments:

  • testsuite: Add tests for vector load/store operations
  • testsuite: Clean all codeGen/should_run/all.T
  • Refactor INSERT
  • Add bangs
  • Rename VectorArithInstns
  • Add type signatures and make names more consistent
  • Add a note documenting our treatment of SIMD registers
  • Refactor treatment of XMM registers
  • Remove redundant Maybes
compiler/cmm/CmmExpr.hs
447

I agree with @AndreasK: It's quite unclear why this is split across two Maybes.

I'm also generally skeptical of this notion of attaching further information to GlobalRegs. This is rather inconsistent with the current story of attaching formats to Instrs.

501

The Ord instance ignores the two Maybes whereas the Eq does not. This is really surprising and generally a sign that we have some design issues here.

compiler/nativeGen/Format.hs
53

It wouldn't hurt to put bangs here.

compiler/nativeGen/X86/CodeGen.hs
520

Let's make these names more descriptive.

547

These names are a bit terse and could perhaps use type annotations.

554

Here too.

725

Perhaps let's improve this panic a bit: "The offset passed to insert operations must be statically known."

857

We should probably rename this to include an _avx suffix.

960

_avx suffix here.

1155

It sounds like we could do this without spilling the value out to memory. However, let's just add a TODO to avoid holding up the patch.

1235

Let's factor out this pattern instead of duplicating the expression.

bgamari updated this revision to Diff 18176.Oct 2 2018, 10:46 AM
  • Another small cleanup
bgamari added inline comments.Oct 2 2018, 11:03 AM
compiler/cmm/CmmExpr.hs
447

@AndreasK, @carter, and I discussed this at ICFP. The design is certainly a bit of a compromise but I've largely come to peace with this design; ultimately I don't see a better way. I have, however, cleaned up the story a bit and have tried to put the scheme on a more principled footing.

Essentially, we view SIMD registers like we view floating point registers: Every possible format the register might hold (e.g. length/width/number type in the case of a SIMD register; float/double in the case of an FP register) is considered to be a distinct register. That is to say that the machine's %xmm1 register has several names in GHC's NCG: XmmReg 1 1 W128 Integer, XmmReg 1 1 W128 Float, XmmReg 1 2 W64 Integer, XmmReg 1 2 W64 Float, etc. During register allocation these are all considered to be the same register, as described in Note [Overlapping global registers]. I have described this in greater detail in Note [SIMD registers].

The reason that this is necessary is that the NCG generally assumes that each GlobalReg has a single, known CmmType (observable via cmmRegType).

compiler/nativeGen/X86/CodeGen.hs
312

I think a simple MicroArch product would already be a great improvement over the status quo.

Harbormaster edited this Differential Revision.Oct 2 2018, 8:04 PM