[WIP] Add support for SIMD operations in the NCG
Needs ReviewPublic

Authored by Abhiroop on Jun 7 2018, 3:42 PM.



This adds support for constructing vector types from Float#, Double# etc and performing arithmetic operations on them

There are a very large number of changes, so older changes are hidden. Show Older Changes

It's nice how much duplication disappeared with that change. Good job :)


ScalarFormat and the non-vector constructors of Format describe the same thing. Maybe you can come up with a way to combine them.


Update this to support vectors as well.


Call this from cmmTypeFormat.

Also given the names of floatFormat and intFormat it should probably be called vecFormat.


Can we move this functionality into getRegisterReg?


I think you will need to add a case handling vectors here.


If there is a simd "negate vector" instructions this would probably be the place to use it.

Abhiroop updated this revision to Diff 16835.Jun 10 2018, 2:10 PM
  • Add support vector addition, multiplications and substraction
  • Add tests for the arithmetic operations
  • Fix vector width logic
  • Fix minor compiler warnings and remove unneccessary changes
Abhiroop updated this revision to Diff 16873.Jun 12 2018, 4:46 PM
  • Add support for float division
  • Add support for packFloatX4#
  • Introduce a new MachOp MO_VF_Broadcast to differentiate between pack and broadcast
  • Add tests for divide and pack
Abhiroop added inline comments.Jun 12 2018, 4:55 PM
AndreasK added inline comments.Jun 15 2018, 5:13 AM

So I did some reading.

Assuming the destination register is an xmm register:

  • If the offset is zero then we indeed should be able to just use VMOV as all scalar operations only use the lowest scalar value in the register.

See also the Intel Manual:

Intel® 64 and IA-32 Architectures Software Developer’s Manual: 11.4.1 Packed and Scalar Double-Precision Floating-Point Instructions

  • If the index is NOT zero

We can shuffle the register so that we move the given index to the lowest position.
Some instructions to look at would be PSHUFLW, pshufd, MOVHLPS and so on.

That way we can use just a single VEXTRACTPS for general registers, and don't have to touch the memory for xmm destinations.

AndreasK requested changes to this revision.Jun 15 2018, 5:13 AM
This revision now requires changes to proceed.Jun 15 2018, 5:13 AM
carter added inline comments.Mon, Jun 18, 1:17 PM

how is register size used? it seems questionable to me that we're providing individual types rather than a set? or perhaps i dont understand this well.

XMM is also ALWAYS used by scalar floating point code, so this is DEFINITELY wrong, because single precision "Float" is 32bit, and double precision "Double" is 64 bit ... what happens if xmm has one of those?


scalar floating point operations will probably ALWAYS be on XMM registers, because for every YMMn / ZMMn, that machine will have a valid XMMn register

but by default we do not assume we have 32 xmm registers, but that IS true on new enough cpus ... afaict :)


are these for Floats/Doubles or Int32/Word32/Int64/word64?

Abhiroop added inline comments.Mon, Jun 18, 2:42 PM

This is just a helper datatype that is used in line 960 of this file to avoid rewriting the nearly same function four times for all the four arithmetic instructions. This will be used while using for other datatypes like Int32, word32 etc

Abhiroop added inline comments.Mon, Jun 18, 2:56 PM

how is register size used?

I mentioned the use case in the comments above. For Cmm like this

_B1::Fx4V128 = XMM1; //CmmAssign

Before my change, the above would break in the Cmm linter. To understand why this would break before, we need to see the notion of weak equality defined for the datatype CmmType: https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmType.hs#L84

because single precision "Float" is 32bit, and double precision "Double" is 64 bit ... what happens if xmm has one of those?

The level at which the linter is working it is not checking the width. It just takes the width whether W32 for Float or W64 for Double and generates a FloatCat https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmType.hs#L106

The linter works using the weak equality notion mentioned in the previous paragraph, it doesn't actually checks the width.

carter added inline comments.Mon, Jun 18, 5:51 PM

wait ... WHY are you setting the register?

what do you mean by XMM1?

Abhiroop added inline comments.Mon, Jun 18, 9:27 PM

I am not sure what you are asking but I have specifically not added anything extra.

For the simple lifting of an unlifted type code below:

data FloatX4 = FX4# FloatX4#

instance Show FloatX4 where
  show (FX4# f) = case (unpackFloatX4# f) of
    (# a, b, c, d #) -> show ((F# a), (F# b), (F# c), (F# d))

main :: IO ()
main = print (FX4# (plusFloatX4# (broadcastFloatX4# 1.3#) (broadcastFloatX4# 2.2#)))

The output cmm looks like this

[Main.FX4#_entry() //  [XMM1]
         { info_tbls: [(c1NK,
                        label: Main.FX4#_info
                        rep: HeapRep static { Fun {arity: 1 fun_type: ArgSpec 9} }
                        srt: Nothing)]
           stack_info: arg_space: 8 updfr_space: Just 8
       c1NK: // global
           _B1::Fx4V128 = XMM1;   // CmmAssign
           goto c1NM;   // CmmBranch
       c1NM: // global
           Hp = Hp + 24;   // CmmAssign
           if (Hp > HpLim) (likely: False) goto c1NO; else goto c1NN;   // CmmCondBranch
       c1NO: // global
           HpAlloc = 24;   // CmmAssign
           goto c1NL;   // CmmBranch
       c1NL: // global
           XMM1 = _B1::Fx4V128;   // CmmAssign
           R1 = Main.FX4#_closure;   // CmmAssign
           call (stg_gc_fun)(XMM1, R1) args: 8, res: 0, upd: 8;   // CmmCall
       c1NN: // global
           // allocHeapClosure
           I64[Hp - 16] = Main.FX4#_con_info;   // CmmStore
           Fx4V128[Hp - 8] = _B1::Fx4V128;   // CmmStore
           _c1NJ::P64 = Hp - 15;   // CmmAssign
           R1 = _c1NJ::P64;   // CmmAssign
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;   // CmmCall
 section ""data" . Main.FX4#_closure" {
         const Main.FX4#_info;

As you can see the Cmm generated _B1::Fx4V128 = XMM1; // CmmAssign would not pass the Cmm linter before. This specifically fixes this.

Abhiroop updated this revision to Diff 17011.Tue, Jun 19, 8:40 AM
  • Modify unpack to use shuffle operation and add tests for it

why do we have a normal CMM expression changing how we do code gen, this seems deeply wrong to me.


Ok, you *may* wanna also see about inspecting the CMM before register allocation as well as after.

it sounds like you're only looking at the post register allocation CMM, both should be looked at i think


wait wait wait ...

why do we have "(CmmLit lit)"

whats this argument for? This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ), and even once we add / work out support for compile time constants, they won't / shouldn't be normal CMMExpressions in the CMM layer, they should be fields in the primop constructor!

Good progress.

Make it a habit to mark comments you consider dealt with as done. That way it's clear from everyone involved what we should still look at!


This conflicts with the comment above!




Now we know. But every new person looking at it will wonder the same thing.

Add a short comment to explain this. I would also suggest moving it closer to where it is used. @carter what do you think?




Update the error message here.
Add should now fail because it requires two operands, not because it requires llvm I assume.


Assuming the pattern match is exhaustive why not make this a function that just fills in the instruction?


VPSHUFD requires AVX. Imo it is overly restrictive to require avx to unpack floats.

The AVX version is probably faster (but verify that!). If so we should generate different code depending on what the cpu supports.

Also why is it VMOVUPS but VPSHUFD.
Shouldn't that be dependent on the Format or at least match between the two?

Abhiroop marked 19 inline comments as done.Tue, Jun 19, 10:48 AM
Abhiroop marked an inline comment as done.
Abhiroop marked an inline comment as done.Tue, Jun 19, 10:50 AM

not sure about some bits, but heres some more thoughts


I thought the ABI was changed for ghc 8.4 or 8.6?! hrmmm, this looks stale and needs more though


negate should be easy ... dont we juxt do a floating point bitwise XOR with a mask thats 1 on the sign bit and zero everwhere else?

except ... one issue we're gonna need to think about is how all these operations behave for NAN! (i suppose i can help with that after we get everything else working)


we are gonna have both SSE and AVX encodings of most things, the current work is simplifying in the following ways

  1. only handling 4xFloat# vectors
  2. first getting the AVX encoding working. (i think ideal end state is to have which instruction subset toggled by march style flags)

later we should figure out how to have per function definition "march / ISA flag" pragmas, so end users can toggle per definition (though per module wouldn't be that bad I guess?)

Abhiroop added inline comments.Tue, Jun 19, 9:45 PM

@AndreasK PS in VMOVUPS refers to single precision float and the D in VPSHUFD refers to double word (or 32 bits). So currently it works for 32 bit Floats.

I want this to be dependent on the Format type but vector instruction seems to have a lot of variations like VBROADCASTSS which I am yet to figure out what the SS stands for. For now I am hardcoding and not using the Format info in the pretty printer as well, but would eventually need to think of doing that.

Abhiroop added inline comments.Tue, Jun 19, 9:50 PM

I added this change many months back when I started to port with AVX2(256 bits ) instructions and didn't have a full understanding of the problems at hand. I can remove this.

AndreasK added inline comments.Wed, Jun 20, 5:43 AM

VPSHUFD refers to double word (or 32 bits).

I guess that makes sense as it can also be used for ints.

VBROADCASTSS which I am yet to figure out what the SS stands for.

As far as I know for SIMD instructions the PS postfix usually indicates packed single-precision while SS stands for scalar single-precision.

It's pretty clear that this holds up for VBROADCAST too if you look at the Intel instruction reference.

  • SS -> Broadcast the low single-precision floating-point element in the source operand to four locations in xmm1.
  • SD -> ... double precision ...
  • F128 -> Broadcast 128 bits of floating-point data ...

As soon as it isn't a scalar being broadcast the first S is dropped.

Abhiroop added inline comments.Wed, Jun 20, 8:24 AM

I am not sure I follow what you meant here.

Do you mean instead of using a separate datatype having VADD, VSUB etc constructors straightaway have constructors called VADDPS, VSUBPS and modify the Instr type accept a type like VectorArithInstns?

AndreasK added inline comments.Wed, Jun 20, 9:37 AM

I was thinking along the lines of that:

code dst = case op of
  -- opcode src2 src1 dst <==> dst = src1 `opcode` src2
  VADD -> arithInstr VADDPS
  VSUB -> arithInstr VSUBPS
      arithInstr instr = unitOL (instr format (OpReg reg2) reg1 dst)
Abhiroop added inline comments.Wed, Jun 20, 10:33 AM

I have been avoiding doing this change because in the getRegisterReg function:

getRegisterReg :: Platform -> Bool -> CmmReg -> Reg the second argument is a Bool checking the SSE2 flag. Most of my functions work with the AVX flag in which case I would have to add another parameter to the function and modify getRegisterReg almost everywhere, because it is frequently used.

I can obviously work around it sending a True in the second argument and having some hacks around this but I actually want to extend the getVecRegisterReg function in the future.

Keeping a separate getVecRegisterReg function (which currently has a parameter for just AVX flag) allows me to extend just this function later to include all the other families like SSE, SSE2 AVX2 etc.

I was thinking of creating a separate type:

data MicroArch = MicroArch { sse   :: Bool
                           , sse2  :: Bool
                           , avx   :: Bool
                           .... }

and modifying the signature of getVecRegisterReg to:

getVecRegisterReg :: Platform -> MicroArch -> CmmReg -> Reg

Abhiroop updated this revision to Diff 17033.Wed, Jun 20, 11:27 AM
  • Add vector negate operation and tests
  • Fix other code review comments
Abhiroop marked 15 inline comments as done.Wed, Jun 20, 11:37 AM
Abhiroop updated this revision to Diff 17034.Wed, Jun 20, 11:48 AM
  • Add some explanatory comments
Abhiroop marked 7 inline comments as done.Wed, Jun 20, 11:50 AM
AndreasK added inline comments.Wed, Jun 20, 3:59 PM

and modify getRegisterReg almost everywhere, because it is frequently used.

There are only about 30 uses which concern us.
I get that it's annoying but most of these call it with True/False so
should be trivial to update. We might have spent more time discussing this by now than it would be to
change it.

I was thinking of creating a separate type:

That sounds like a great idea! Do you see issues stopping us from just making getRegisterReg take a MicroArch argument instead?
Seems like a much cleaner solution to me.

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?
I think there is no architecture which supports newer SIMD extensions but drops old ones. So we can define Ord and just do
(if arch >= sse2) then cleanSolution else fpStackHack and the like.

We often have the pattern for sse of:

use_sse2 <- sse2Enabled
getRegisterReg ... use_sse2

Then we could do somthing like below instead:

sse_level <-sseSupport :: m MicroArch
getRegisterReg pf sse_level reg

But I also understand the desire/need to get this working before you think about clean code.
I suggest to leave it as is for now and once it's time to extend getVecRegisterReg replace it with an extended version of getRegisterReg instead.

I've been playing around with code on this branch. If I want to merge code into it, how might I go about it?

Also, I've discovered my CPU doesn't actually have AVX512, but it has AVX2, so I'm looking to be able to write code against AVX2 flags.

Abhiroop added inline comments.Thu, Jun 21, 8:39 AM

Maybe we can even get away with just MicroArch = SSE2 | SSE4 | SSE4_2 | ...?

I had initially thought of something almost along the exact same lines and then defining an Ord instance for the datatype. But in an old discussion with @carter he had actually mentioned it is not always true that each new family of microarchitecture would always be a superset of the older families. They might not always support the previous micro-architectures. I recall him mentioning that maybe not the Intel or AMD machines but certain custom x86 architectures have that property.

Also making MicroArch a sum type would imply us checking or working with one flag at a time. However, I thought of a record type because there will be situations where multiple flags might be switched on like MicroArch{ sse4 = True, avx = True, sse1 = False, .....} and the register allocation would have a fancier logic which would take into account this fact. Although, this might be me overthinking. :)

I put a test PR up to see if I can make a PR. Looks like the base is different to that in D4813:


Abhiroop updated this revision to Diff 17098.Tue, Jun 26, 5:02 PM
  • Rewrite the pretty printer to use the Format information and generalize the instructions
  • Add support for packing, unpacking, broadcasting doubles
  • Add support for arithmetic operations on doubles
  • Add tests for the above and minor changes in the linter and other places to accomodate doubles
Abhiroop retitled this revision from [WIP] Add support for broadcastFloatX4# and unpackFloatX4# operations to [WIP] Add support for SIMD operations in the NCG.Tue, Jun 26, 6:09 PM
Abhiroop edited the summary of this revision. (Show Details)
Abhiroop added inline comments.Tue, Jun 26, 6:38 PM

This part currently works but this is not fully correct. This function maps an STG register to the respective CmmType. While this initially stated cmmVec 4 (cmmBits W32) this was incorrect because it can possibly hold 16 Int8s or 8 Int16s etc.

And even now it is stating that an XMM can hold only 2 64 bit doubles. While this change makes this code run, but we need to somehow modify the GlobalReg type to capture some more information or at least pass some more info(like length and width) to the function globalRegType so that a correct CmmType can be generated which covers all the cases.


whats this argument for?

@carter when you mentioned that this should be a part of a prim-op constructor do note that in this operation I am using CmmLit just to represent an offset value. So are you suggesting that the offset value should be part of the prim-op constructor actually making it MO_VF_Extract Length Width Offset? Don't you think that is actually leaking details of the assembly onto the constructors of Cmm, the same thing that we were discussing that we want to avoid.

Maybe I didn't get what you meant, but I think your comment was just for the shuffle operations where we should have something on the lines of MO_V_Shuffle Length Width **Lit** where the third field of the constructor is the compile time literal.

This seems wrong (i dont think we have anything preventing literals from being in a register or something ... ),

If you are asking about the utility of CmmLit which is a constructor of the CmmExpr datatype, I am not sure what purpose it serves. It existed before and is used in a number of places in the code generator. Also nothing prevents the literals from being in registers. Maybe someone with better idea of the codegenerator can answer what purpose CmmLit solves.

Abhiroop updated this revision to Diff 17115.Thu, Jun 28, 4:35 PM
  • Disable the liveness analysis for vector operations when running with -O2
  • Add tests for packing and unpacking with -O2
Abhiroop updated this revision to Diff 17134.Fri, Jun 29, 10:24 AM
  • Re-enable the liveness analysis of vector operations
  • Change the constructors and other handlers of packing operations
  • Temporarily disable inlining for vector operations
Abhiroop added inline comments.Fri, Jun 29, 10:42 AM
454 ↗(On Diff #17134)

@carter @AndreasK

I modified the arguments of MO_VF_Insert from something like %MO_VF_Insert_4_W32(1.5 :: W32, 0 :: W32); to something like %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32, 0 :: W32); where the first argument refers to the Cmm register where the packing is taking place.

While this allows me to remove my liveness hack, it gives rise to another issue that the Cmm liveness analysis has a clause on noLiveOnEntry which doesn't allow us to enter a block with a live register like _c1LB::Fx4V128. The noLiveOnEntry is switched on only for -O2. So for -O2 to work I have to define something like

_c1LB::Fx4V128 = <0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>;   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

The first line results that I have to anyways end up XORing the register in assembly. And it raises another complication in -O2. It tries to inline and the above Cmm becomes:

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(<0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                  0.0 :: W32>, 9.2 :: W32,
                                     0 :: W32);   // CmmAssign
_c1LB::Fx4V128 = %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 8.15 :: W32,
                                     80 :: W32);   // CmmAssign

_c1LB::Fx4V128 = %MO_VF_Insert_4_W32( %MO_VF_Insert_4_W32(_c1LB::Fx4V128, 7.0 :: W32,
                                     160 :: W32);, 6.4 :: W32,
                                     240 :: W32);   // CmmAssign

It not only inlines 0.0 :: W32, 0.0 :: W32, 0.0 :: W32, 0.0 :: W32> but also starts inlining the constructors which results that the pattern matches in the code generators starts failing.

So the above change disables the inlining temporarily for vector operations. Let me know if you have any better suggestions

AndreasK added inline comments.Sat, Jun 30, 8:11 AM
454 ↗(On Diff #17134)

which doesn't allow us to enter a block with a live register

You can avoid this by replacing one insert with a mov instruction like we talked about on IRC.

results that the pattern matches in the code generators starts failing.

Is this because you expect the arguments to the MachOP to already be a register? You should be able to handle that case using getSomeReg. I don't think we should special case this so much in the Cmm stage and rather just evaluate the expression into a register during X86 codegen when it's not already a register.

Abhiroop updated this revision to Diff 17237.Sun, Jul 8, 3:28 PM
  • Enrich STG vector registers with Length width and register type information
  • Add support for the SSE family of vector instructions
carter added a comment.Mon, Jul 9, 2:02 PM

just did a first passs on the latest changes, though only made my way through the first 50% :)


you can definitely define a helper function instead of duplicating the code here twice


same issue applies to YMM and ZMM,

after all, XMM registers are just the lower 128 bits of those!




a TODO we can address some other time, because there are C intrinsics per platform

Abhiroop marked 7 inline comments as done.Mon, Jul 9, 8:11 PM
Abhiroop added inline comments.

As I was experimenting with this representation, I wanted to keep the surface area of the experiment as small as possible. But yes I will eventually extend this to YMM and ZMM registers.


Yes, I didn't understand what this comparison means!

Abhiroop updated this revision to Diff 17250.Mon, Jul 9, 8:25 PM
  • Remove the inlining hack and allow -O2 to fully work for vectors
Abhiroop updated this revision to Diff 17313.Sat, Jul 14, 12:01 AM
  • Modify all pack/broadcast variants to accept any CmmExpr argument and rewrite codegen for that
  • Modify the StgCmmPrim definition of broadcast
  • Modify CmmLint condition for broadcast