1. Modify the StgCmmPrim bridge to remove the wasted register zeroing 2. Add some vector register utils like making temporary virtual registers 3. Add the width hardcoded version of braodcast 4. Add the entire pretty printing logic for the 3...

Authored by Abhiroop on Jun 3 2018, 2:15 PM.




Abhiroop created this revision.Jun 3 2018, 2:15 PM
Abhiroop removed subscribers: rwbarton, thomie, carter.
Abhiroop updated this revision to Diff 16686.Jun 3 2018, 2:52 PM
  • Add a vecFormat to width helper function, fix some arcanist linter comments and remove undefined instances
Abhiroop edited reviewers, added: carter; removed: simonmar.Jun 3 2018, 2:53 PM
Abhiroop added inline comments.Jun 3 2018, 3:03 PM

@carter This is the issue in the STGtoCmm bridge that I was mentioning. For an individual broadcast operation it emits n "MO_VF_INSERT" Cmm instructions which is not necessary.


The second issue in the STG to Cmm bridge. According to me this assignment was unnecessary.

The Cmm for this originally looked like this

c1JZ: // global
           _c1JS::Fx4V128 = <0.0 :: W32, 0.0 :: W32, 0.0 :: W32,
                             0.0 :: W32>;   // CmmAssign
           _c1JT::Fx4V128 = %MO_VF_Insert_4_W32(_c1JS::Fx4V128, 1.5 :: W32,
                                                0 :: W32);   // CmmAssign
           _c1JU::Fx4V128 = %MO_VF_Insert_4_W32(_c1JT::Fx4V128, 1.5 :: W32,
                                                1 :: W32);   // CmmAssign
           _c1JV::Fx4V128 = %MO_VF_Insert_4_W32(_c1JU::Fx4V128, 1.5 :: W32,
                                                2 :: W32);   // CmmAssign
           _c1JW::Fx4V128 = %MO_VF_Insert_4_W32(_c1JV::Fx4V128, 1.5 :: W32,
                                                3 :: W32);   // CmmAssign
           _c1JR::Fx4V128 = _c1JW::Fx4V128;   // CmmAssign
           XMM1 = _c1JR::Fx4V128;   // CmmAssign
           call (P64[Sp])(XMM1) args: 8, res: 0, upd: 8;   // CmmCall

The _c1JS register is just thrown away and I think it is not contributing anything to the Cmm due to which I commented out this part.

carter added a comment.Jun 3 2018, 5:21 PM

@Abhiroop when the intel docs say the args are xmm1 xmm2 and xmm3, they dont mean those specific registers, they mean "x y z are the args, and theyy're all xmm registers",
I think this confusion came up a few times when we spoke? and I didn't understand at the time what was causing it. :)

this is just a first pass at a bunch of stuff, ben and andreask might be more structured/ immediately actionable points


why did you change this to 1? That seems wrong, but perhaps i'm not seeing something


every registerized platform is native code gen or llvm, so this error isn't quite helpful.

we're gonna want to say something a tad more specific for why we do / dont support a given instruction


could you explain this more?


this branch seems redundant at the moment :)


what do you mean by temporary register here?


this looks wrong. There should be a register result, a register arg, and a register or memory location arg (which i guess might be an operand expression?)

VEX.NDS.128.0F.WIG 58 /r
VADDPS xmm1,xmm2, xmm3/m128

Add packed single-precision floating-point values from
xmm3/m128 to xmm2 and store result in xmm1.


why is the source input different here? (vs eg vmovups or the like)

i guess this is a broader question

what is the difference between mkRU [x] [y]

mkrRU (use_R x []) [y]


these seem correct


you mean XMM ?


we will want to support length 2 and perhaps length 1 too!


hold up.... our scalars can be Float, or Double on the floating point side... (which are different)

not sure on the int side. native hardware supports both signed and unsigned ints/words of various sizes


we are never going to see 80 bit floats in simd, ever. Not sure if thats relevent


it does seem odd that we dont currently support encoding float literals correctly , but thats another task i suppose

it'd be around doing the hex / power of two encoding (which doesn't really have rounding issues)


doesn't add take two input registers plus a result register?

this seems like it should be the same as the machinery for VPXOR (though less generic in what vectors you accept)


so I assume this is LOGICAL XOR

VEX.NDS.128.66.0F.WIG EF /r
VPXOR xmm1, xmm2, xmm3/m128
Bitwise XOR of xmm3/m128 and xmm2.

this looks correct.

I *think* this works for both float/double/word32/word64 etc/,,, though i could be wrong


this seems correct... i think

Abhiroop added inline comments.Jun 4 2018, 7:06 AM

Yes you are correct, this is one of the things I have to fix.


The only use case of this constructor is passing the VecFormat instead of Format.


Yes will be refactoring this part of the code as well


Yes you are correct this is wrong. This should look similar to the VPXOR case.




Yeah I think this is because x86 assembly does not allow addressing float literals like "$1.5". Floats have to be hex encoded.

Abhiroop edited reviewers, added: AndreasK; removed: simonmar.Jun 4 2018, 7:19 AM
carter requested changes to this revision.Jun 4 2018, 8:46 AM

Theres no tests! (even though we can't run them yet, lets write some tests!)


shorter widths, like 64 or 32 bits... ARE valid :)


i think this was written for llvm of llvm, and thats not the right desugarring depending on the target backend / platform


i guess i'll have to read the register code myself :)
i'm just not sure if we need a distinguished vector type? but that seems fine for now


why did you put 80 here? this seems blatantly wrong

when you're not sure what do to, make it an error, not a wrong hting :)

panic/error/etc and also make it clear from comments / error message th the code in question isn't supposed to be final

*never put wrong code* in :) throw an error instead !

This revision now requires changes to proceed.Jun 4 2018, 8:46 AM
Abhiroop added inline comments.Jun 4 2018, 11:56 AM

@carter Good catch was meaning to remove this, I missed this one


the mkRU function only accepts arguments of the type "Reg". the The type of src is "Operand" so it has to be manually converted to a "Reg" type using "use_R"

I think it would make sense to make VecFormat an Constructor of Format instead of it's own type.

This would remove a lot of the duplication of functions. While it's fine for now to keep it so, it will only be more work to change this later.


What is the reason why this needs to be a new data type instead of extending Format?

If you make them distinct you have to replicate many code paths.
This already shows by the doThing/doThingVec duplication.

There might be good reasons to do so. But if so document them. If not change this!


ScalarFormat to me implies that it describes the format of an scalar fully. Maybe something like VectorType would be better?


Duplication because of the VecFormat/Format distinction.


Duplication because of the VecFormat/Format distinction.


Duplication because of the VecFormat/Format distinction.


Take a look at the LDATA pseudo instructions for this. The NCG will convert these to static data you can reference.

carter added inline comments.Jun 4 2018, 2:37 PM

additionally, we should NEVER be just saying Int. Thats ambiguous, it needs to always have the size. Int32 Int64 etc. Plus Word too!

Float vs Double is unambiguous, but also certainly differnt from just being Float :)

i guess this gets into having native support for Int32# and Int64# along with analogues Word sized unboxed types

carter added a comment.Jun 4 2018, 2:40 PM

i think it might be worth poking at https://phabricator.haskell.org/D4475 and getting it in shape for perhaps ghc 8.8 as a prerequisite for this set of work to have the right internal data model

I'll try to hack on it after mid june perhaps


I guess this exact thing might motivate us iterating on https://phabricator.haskell.org/D4475 so we can do the correct api here

Could we have a little context for this diff please? A descriptive summary would be really great. What problem are you solving? Do you have a test plan and results?

@simonmar This diff is basically my first diff in extending the native code generator to add support for SIMD instructions, in response to the trac ticket : https://ghc.haskell.org/trac/ghc/ticket/7741 I am doing this project as part of Google Summer of Code (GSoC).

This diff was my first attempt in using phabricator and arcanist as a result I did not add any detailed summary. Sorry about that. The intention of this diff was solely for @carter, my GSoC mentor, to give me feedback on whatever I have done till now and make me familiar with using Phabricator. I didn't want to notify you and tried removing your handle from the reviewers list, however I think it is configured to automatically notify you.

This piece code is very much a work in progress and contains a number of hardcoded results. The basic approach is as listed in this comment: https://ghc.haskell.org/trac/ghc/ticket/7741#comment:13

In this diff I am trying to add the support of broadcast operations for floats, which is the function:

broadcastFloatX4# :: Float# -> FloatX4#

I do intend to add a number of tests and integrate the feedback that I received from @carter and @AndreasK on this diff and raise a new diff with a number of other operations this week. So for now, please ignore this diff.

@Abhiroop thanks for the clarification!

Assuming all changes in here are also in the other diff you could abandon this one to make it clear.

Abhiroop abandoned this revision.Jun 11 2018, 8:39 AM