[WIP] Add support for SIMD operations in the NCG
Needs ReviewPublic

Authored by Abhiroop on Jun 7 2018, 3:42 PM.



This adds support for constructing vector types from Float#, Double# etc and performing arithmetic operations on them

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Modify unpack to use shuffle operation and add tests for it

why do we have a normal CMM expression changing how we do code gen, this seems deeply wrong to me.

589 ↗(On Diff #16815)

Ok, you *may* wanna also see about inspecting the CMM before register allocation as well as after.

it sounds like you're only looking at the post register allocation CMM, both should be looked at i think


wait wait wait ...

why do we have "(CmmLit lit)"

whats this argument for? This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ), and even once we add / work out support for compile time constants, they won't / shouldn't be normal CMMExpressions in the CMM layer, they should be fields in the primop constructor!

Good progress.

Make it a habit to mark comments you consider dealt with as done. That way it's clear from everyone involved what we should still look at!


This conflicts with the comment above!




Now we know. But every new person looking at it will wonder the same thing.

Add a short comment to explain this. I would also suggest moving it closer to where it is used. @carter what do you think?




Update the error message here.
Add should now fail because it requires two operands, not because it requires llvm I assume.


Assuming the pattern match is exhaustive why not make this a function that just fills in the instruction?


VPSHUFD requires AVX. Imo it is overly restrictive to require avx to unpack floats.

The AVX version is probably faster (but verify that!). If so we should generate different code depending on what the cpu supports.

Also why is it VMOVUPS but VPSHUFD.
Shouldn't that be dependent on the Format or at least match between the two?

Abhiroop marked 19 inline comments as done.Jun 19 2018, 10:48 AM
Abhiroop marked an inline comment as done.
Abhiroop marked an inline comment as done.Jun 19 2018, 10:50 AM

not sure about some bits, but heres some more thoughts


I thought the ABI was changed for ghc 8.4 or 8.6?! hrmmm, this looks stale and needs more though


negate should be easy ... dont we juxt do a floating point bitwise XOR with a mask thats 1 on the sign bit and zero everwhere else?

except ... one issue we're gonna need to think about is how all these operations behave for NAN! (i suppose i can help with that after we get everything else working)


we are gonna have both SSE and AVX encodings of most things, the current work is simplifying in the following ways

  1. only handling 4xFloat# vectors
  2. first getting the AVX encoding working. (i think ideal end state is to have which instruction subset toggled by march style flags)

later we should figure out how to have per function definition "march / ISA flag" pragmas, so end users can toggle per definition (though per module wouldn't be that bad I guess?)

Abhiroop added inline comments.Jun 19 2018, 9:45 PM

@AndreasK PS in VMOVUPS refers to single precision float and the D in VPSHUFD refers to double word (or 32 bits). So currently it works for 32 bit Floats.

I want this to be dependent on the Format type but vector instruction seems to have a lot of variations like VBROADCASTSS which I am yet to figure out what the SS stands for. For now I am hardcoding and not using the Format info in the pretty printer as well, but would eventually need to think of doing that.

Abhiroop added inline comments.Jun 19 2018, 9:50 PM

I added this change many months back when I started to port with AVX2(256 bits ) instructions and didn't have a full understanding of the problems at hand. I can remove this.

AndreasK added inline comments.Jun 20 2018, 5:43 AM

VPSHUFD refers to double word (or 32 bits).

I guess that makes sense as it can also be used for ints.

VBROADCASTSS which I am yet to figure out what the SS stands for.

As far as I know for SIMD instructions the PS postfix usually indicates packed single-precision while SS stands for scalar single-precision.

It's pretty clear that this holds up for VBROADCAST too if you look at the Intel instruction reference.

  • SS -> Broadcast the low single-precision floating-point element in the source operand to four locations in xmm1.
  • SD -> ... double precision ...
  • F128 -> Broadcast 128 bits of floating-point data ...

As soon as it isn't a scalar being broadcast the first S is dropped.

Abhiroop added inline comments.Jun 20 2018, 8:24 AM

I am not sure I follow what you meant here.

Do you mean instead of using a separate datatype having VADD, VSUB etc constructors straightaway have constructors called VADDPS, VSUBPS and modify the Instr type accept a type like VectorArithInstns?

AndreasK added inline comments.Jun 20 2018, 9:37 AM

I was thinking along the lines of that:

code dst = case op of
  -- opcode src2 src1 dst <==> dst = src1 `opcode` src2
  VADD -> arithInstr VADDPS
  VSUB -> arithInstr VSUBPS
      arithInstr instr = unitOL (instr format (OpReg reg2) reg1 dst)
Abhiroop added inline comments.Jun 20 2018, 10:33 AM

I have been avoiding doing this change because in the getRegisterReg function:

getRegisterReg :: Platform -> Bool -> CmmReg -> Reg the second argument is a Bool checking the SSE2 flag. Most of my functions work with the AVX flag in which case I would have to add another parameter to the function and modify getRegisterReg almost everywhere, because it is frequently used.

I can obviously work around it sending a True in the second argument and having some hacks around this but I actually want to extend the getVecRegisterReg function in the future.

Keeping a separate getVecRegisterReg function (which currently has a parameter for just AVX flag) allows me to extend just this function later to include all the other families like SSE, SSE2 AVX2 etc.

I was thinking of creating a separate type:

data MicroArch = MicroArch { sse   :: Bool
                           , sse2  :: Bool
                           , avx   :: Bool
                           .... }

and modifying the signature of getVecRegisterReg to:

getVecRegisterReg :: Platform -> MicroArch -> CmmReg -> Reg

Abhiroop updated this revision to Diff 17033.Jun 20 2018, 11:27 AM
  • Add vector negate operation and tests
  • Fix other code review comments
Abhiroop marked 15 inline comments as done.Jun 20 2018, 11:37 AM
Abhiroop updated this revision to Diff 17034.Jun 20 2018, 11:48 AM
  • Add some explanatory comments
Abhiroop marked 7 inline comments as done.Jun 20 2018, 11:50 AM
AndreasK added inline comments.Jun 20 2018, 3:59 PM

and modify getRegisterReg almost everywhere, because it is frequently used.

There are only about 30 uses which concern us.
I get that it's annoying but most of these call it with True/False so
should be trivial to update. We might have spent more time discussing this by now than it would be to
change it.

I was thinking of creating a separate type:

That sounds like a great idea! Do you see issues stopping us from just making getRegisterReg take a MicroArch argument instead?
Seems like a much cleaner solution to me.

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?
I think there is no architecture which supports newer SIMD extensions but drops old ones. So we can define Ord and just do
(if arch >= sse2) then cleanSolution else fpStackHack and the like.

We often have the pattern for sse of:

use_sse2 <- sse2Enabled
getRegisterReg ... use_sse2

Then we could do somthing like below instead:

sse_level <-sseSupport :: m MicroArch
getRegisterReg pf sse_level reg

But I also understand the desire/need to get this working before you think about clean code.
I suggest to leave it as is for now and once it's time to extend getVecRegisterReg replace it with an extended version of getRegisterReg instead.

I've been playing around with code on this branch. If I want to merge code into it, how might I go about it?

Also, I've discovered my CPU doesn't actually have AVX512, but it has AVX2, so I'm looking to be able to write code against AVX2 flags.

Abhiroop added inline comments.Jun 21 2018, 8:39 AM

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?

I had initially thought of something almost along the exact same lines and then defining an Ord instance for the datatype. But in an old discussion with @carter he had actually mentioned it is not always true that each new family of microarchitecture would always be a superset of the older families. They might not always support the previous micro-architectures. I recall him mentioning that maybe not the Intel or AMD machines but certain custom x86 architectures have that property.

Also making MicroArch a sum type would imply us checking or working with one flag at a time. However, I thought of a record type because there will be situations where multiple flags might be switched on like MicroArch{ sse4 = True, avx = True, sse1 = False, .....} and the register allocation would have a fancier logic which would take into account this fact. Although, this might be me overthinking. :)

I put a test PR up to see if I can make a PR. Looks like the base is different to that in D4813:


Abhiroop updated this revision to Diff 17098.Jun 26 2018, 5:02 PM
  • Rewrite the pretty printer to use the Format information and generalize the instructions
  • Add support for packing, unpacking, broadcasting doubles
  • Add support for arithmetic operations on doubles
  • Add tests for the above and minor changes in the linter and other places to accomodate doubles
Abhiroop retitled this revision from [WIP] Add support for broadcastFloatX4# and unpackFloatX4# operations to [WIP] Add support for SIMD operations in the NCG.Jun 26 2018, 6:09 PM
Abhiroop edited the summary of this revision. (Show Details)
Abhiroop added inline comments.Jun 26 2018, 6:38 PM
588 ↗(On Diff #17098)

This part currently works but this is not fully correct. This function maps an STG register to the respective CmmType. While this initially stated cmmVec 4 (cmmBits W32) this was incorrect because it can possibly hold 16 Int8s or 8 Int16s etc.

And even now it is stating that an XMM can hold only 2 64 bit doubles. While this change makes this code run, but we need to somehow modify the GlobalReg type to capture some more information or at least pass some more info(like length and width) to the function globalRegType so that a correct CmmType can be generated which covers all the cases.


whats this argument for?

@carter when you mentioned that this should be a part of a prim-op constructor do note that in this operation I am using CmmLit just to represent an offset value. So are you suggesting that the offset value should be part of the prim-op constructor actually making it MO_VF_Extract Length Width Offset? Don't you think that is actually leaking details of the assembly onto the constructors of Cmm, the same thing that we were discussing that we want to avoid.

Maybe I didn't get what you meant, but I think your comment was just for the shuffle operations where we should have something on the lines of MO_V_Shuffle Length Width **Lit** where the third field of the constructor is the compile time literal.

This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ),

If you are asking about the utility of CmmLit which is a constructor of the CmmExpr datatype, I am not sure what purpose it serves. It existed before and is used in a number of places in the code generator. Also nothing prevents the literals from being in registers. Maybe someone with better idea of the codegenerator can answer what purpose CmmLit solves.

Abhiroop updated this revision to Diff 17115.Jun 28 2018, 4:35 PM
  • Disable the liveness analysis for vector operations when running with -O2
  • Add tests for packing and unpacking with -O2
Abhiroop updated this revision to Diff 17134.Jun 29 2018, 10:24 AM
  • Re-enable the liveness analysis of vector operations
  • Change the constructors and other handlers of packing operations
  • Temporarily disable inlining for vector operations
Abhiroop added inline comments.Jun 29 2018, 10:42 AM
454 ↗(On Diff #17134)

@carter @AndreasK

I modified the arguments of MO_VF_Insert from something like %MO_VF_Insert_4_W32(1.5 :: W32, 0 :: W32); to something like %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32, 0 :: W32); where the first argument refers to the Cmm register where the packing is taking place.

While this allows me to remove my liveness hack, it gives rise to another issue that the Cmm liveness analysis has a clause on noLiveOnEntry which doesn't allow us to enter a block with a live register like _c1LB::Fx4V128. The noLiveOnEntry is switched on only for -O2. So for -O2 to work I have to define something like

_c1LB::Fx4V128 = <0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>;   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

The first line results that I have to anyways end up XORing the register in assembly. And it raises another complication in -O2. It tries to inline and the above Cmm becomes:

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(<0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32( %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

It not only inlines 0.0 :: W32, 0.0 :: W32, 0.0 :: W32, 0.0 :: W32> but also starts inlining the constructors which results that the pattern matches in the code generators starts failing.

So the above change disables the inlining temporarily for vector operations. Let me know if you have any better suggestions

AndreasK added inline comments.Jun 30 2018, 8:11 AM
454 ↗(On Diff #17134)

which doesn't allow us to enter a block with a live register

You can avoid this by replacing one insert with a mov instruction like we talked about on IRC.

results that the pattern matches in the code generators starts failing.

Is this because you expect the arguments to the MachOP to already be a register? You should be able to handle that case using getSomeReg. I don't think we should special case this so much in the Cmm stage and rather just evaluate the expression into a register during X86 codegen when it's not already a register.

Abhiroop updated this revision to Diff 17237.Jul 8 2018, 3:28 PM
  • Enrich STG vector registers with Length width and register type information
  • Add support for the SSE family of vector instructions
carter added a comment.Jul 9 2018, 2:02 PM

just did a first passs on the latest changes, though only made my way through the first 50% :)


you can definitely define a helper function instead of duplicating the code here twice

426 ↗(On Diff #17237)

same issue applies to YMM and ZMM,

after all, XMM registers are just the lower 128 bits of those!

535 ↗(On Diff #17237)



a TODO we can address some other time, because there are C intrinsics per platform

Abhiroop marked 7 inline comments as done.Jul 9 2018, 8:11 PM
Abhiroop added inline comments.
426 ↗(On Diff #17237)

As I was experimenting with this representation, I wanted to keep the surface area of the experiment as small as possible. But yes I will eventually extend this to YMM and ZMM registers.

535 ↗(On Diff #17237)

Yes, I didn't understand what this comparison means!

Abhiroop updated this revision to Diff 17250.Jul 9 2018, 8:25 PM
  • Remove the inlining hack and allow -O2 to fully work for vectors
Abhiroop updated this revision to Diff 17313.Jul 14 2018, 12:01 AM
  • Modify all pack/broadcast variants to accept any CmmExpr argument and rewrite codegen for that
  • Modify the StgCmmPrim definition of broadcast
  • Modify CmmLint condition for broadcast
bgamari updated this revision to Diff 18170.Oct 1 2018, 9:46 PM

Rebase and address comments:

  • testsuite: Add tests for vector load/store operations
  • testsuite: Clean all codeGen/should_run/all.T
  • Refactor INSERT
  • Add bangs
  • Rename VectorArithInstns
  • Add type signatures and make names more consistent
  • Add a note documenting our treatment of SIMD registers
  • Refactor treatment of XMM registers
  • Remove redundant Maybes
427 ↗(On Diff #17313)

I agree with @AndreasK: It's quite unclear why this is split across two Maybes.

I'm also generally skeptical of this notion of attaching further information to GlobalRegs. This is rather inconsistent with the current story of attaching formats to Instrs.

482 ↗(On Diff #17313)

The Ord instance ignores the two Maybes whereas the Eq does not. This is really surprising and generally a sign that we have some design issues here.


It wouldn't hurt to put bangs here.


Let's make these names more descriptive.


These names are a bit terse and could perhaps use type annotations.


Here too.


Perhaps let's improve this panic a bit: "The offset passed to insert operations must be statically known."


We should probably rename this to include an _avx suffix.


_avx suffix here.


It sounds like we could do this without spilling the value out to memory. However, let's just add a TODO to avoid holding up the patch.


Let's factor out this pattern instead of duplicating the expression.

bgamari updated this revision to Diff 18176.Oct 2 2018, 10:46 AM
  • Another small cleanup
bgamari added inline comments.Oct 2 2018, 11:03 AM
427 ↗(On Diff #17313)

@AndreasK, @carter, and I discussed this at ICFP. The design is certainly a bit of a compromise but I've largely come to peace with this design; ultimately I don't see a better way. I have, however, cleaned up the story a bit and have tried to put the scheme on a more principled footing.

Essentially, we view SIMD registers like we view floating point registers: Every possible format the register might hold (e.g. length/width/number type in the case of a SIMD register; float/double in the case of an FP register) is considered to be a distinct register. That is to say that the machine's %xmm1 register has several names in GHC's NCG: XmmReg 1 1 W128 Integer, XmmReg 1 1 W128 Float, XmmReg 1 2 W64 Integer, XmmReg 1 2 W64 Float, etc. During register allocation these are all considered to be the same register, as described in Note [Overlapping global registers]. I have described this in greater detail in Note [SIMD registers].

The reason that this is necessary is that the NCG generally assumes that each GlobalReg has a single, known CmmType (observable via cmmRegType).


I think a simple MicroArch product would already be a great improvement over the status quo.

Harbormaster edited this Differential Revision.Oct 2 2018, 8:04 PM
Abhiroop updated this revision to Diff 18886.Nov 26 2018, 4:27 AM
  • Add Xmm changes to LLVM backend

Some unwanted changes related to circle ci has landed on this diff. Not sure how. I will remove them later. Please ignore the first 2 files (config.yml and prepare-system.sh)