Add cmpeq pack int instructions
Needs RevisionPublic

Authored by newhoggy on Jun 15 2018, 12:06 AM.

Details

Summary

Add cmpeq pack int primop (native code generation)

newhoggy created this revision.Jun 15 2018, 12:06 AM

Work In Progress.

Will need to test on both native and llvm.

newhoggy updated this revision to Diff 16929.Jun 15 2018, 6:53 AM
  • Fix test and emulated primops. Add llvm support.

cmpeq* instructions are working both natively and with llvm.

I will follow up with the same for cmpgt* instructions as well.

newhoggy updated this revision to Diff 16932.Jun 15 2018, 8:51 AM
  • Add cmpgt packed int instructions
newhoggy updated this revision to Diff 16951.Jun 15 2018, 8:59 PM
  • Add -mmmx flag
newhoggy updated this revision to Diff 16952.Jun 15 2018, 9:01 PM
  • Add -mmmx flag
newhoggy updated this revision to Diff 16953.Jun 15 2018, 9:44 PM
  • Add -mmmx flag

I just added the -mmmx compiler flag. Only problem is it seems to be on permanently. I'm expecting that if I don't specify it that it should fall back to the functions in cmpeqpi.c and cmpgtpi.c.

newhoggy updated this revision to Diff 16956.Jun 16 2018, 10:22 AM
  • Fix instruction mnemonic name

It turns out that GHC will append a q to the name of my instruction, which in the case of the instructions I want to support is incorrect:

/var/folders/8d/3xbnllbx76gcbk3wwy086vlm0000gn/T/ghc97266_0/ghc_2.s:464:2: error:
     error: invalid instruction mnemonic 'pcmpeqbq'
            pcmpeqbq %rbx,%rax,%rax
            ^~~~~~~~
    |
464 |         pcmpeqbq %rbx,%rax,%rax
    |  ^

This is likely due to the fact that the instruction I'm trying to implement a primop for is a SIMD instruction and that these are treated specially.

I will have to look at some example SIMD instructions and figure out what needs to change.

I'd like to rebase on D4813 to get access to some of that code. What's the best way to go about this?

@newhoggy Not very sure how you would proceed to rebase the work on D4813 but I can point you to the original repo where this work is happening: https://github.com/Abhiroop/ghc-1/tree/wip/simd-ncg-support

Do note that the last time I rebased ghc mainline onto my branch was nearly a month back so if you are using some very new month old features, they won't be available at my branch.

newhoggy updated this revision to Diff 17014.Jun 19 2018, 12:48 PM

Add cmpeq pack int instructions

newhoggy added a comment.EditedJun 19 2018, 12:50 PM

Maybe I can managed to get this to work without rebasing on D4813, but I will need to be able to move the data to and from an MMX register.

If I don't, I get this error:

$ cat CmpEq.hs
{-# LANGUAGE MagicHash #-}

module Main where

import Data.Bits
import GHC.Int
import GHC.Prim
import GHC.Word
import Data.Int
import Data.Word
import System.IO

cmpeq_pi8  (W64# a#) (W64# b#) = W64# (cmpeq_pi8_w64#  a# b#)
cmpeq_pi16 (W64# a#) (W64# b#) = W64# (cmpeq_pi16_w64# a# b#)
cmpeq_pi32 (W64# a#) (W64# b#) = W64# (cmpeq_pi32_w64# a# b#)

main :: IO ()
main = do
  putStrLn $ "Data: " <> show (cmpeq_pi8 0 0)
  putStrLn $ "Data: " <> show (cmpeq_pi8 0xffffffffffffffff 0x0000000000000000)
$ ./bin/ghc -mmmx CmpEq.hs
[1 of 1] Compiling Main             ( CmpEq.hs, CmpEq.o )

/var/folders/8d/3xbnllbx76gcbk3wwy086vlm0000gn/T/ghc90451_0/ghc_2.s:54:10: error:
     error: invalid operand for instruction
            pcmpeqb %bl,%al,%al
                    ^~~
   |
54 |         pcmpeqb %bl,%al,%al
   |          ^
...

Comparing with what gcc does, the following shows I need to access two MMX registers:

$ cat cmpeq.c
#include <mmintrin.h>
#include <unistd.h>

int main() {
  __m64 a = (__m64)0x1L;
  __m64 b = (__m64)0x2L;
  __m64 x = _mm_cmpeq_pi8 (a, b);
  return (int)(uint64_t)x;
}
$ gcc -S cmpeq.c
$ cat cmpeq.s | grep -A 5 -B 5 cmp
	movq	-24(%rbp), %rcx
	movq	%rax, -32(%rbp)
	movq	%rcx, -40(%rbp)
	movq	-32(%rbp), %mm0
	movq	-40(%rbp), %mm1
	pcmpeqb	%mm1, %mm0
	movq	%mm0, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, -96(%rbp)
	movq	-96(%rbp), %rax
	movq	%rax, -72(%rbp)

How might I achieve that?

umm, perhaps this should be folded in the work abhiroop is doing?

if this is using XMM registers, ghc needs a bit of reworking for supporting integer values in SIMD registers...

to clarify: ghc currently has no understanding of int/word can be in xmm register.

to move them between GPR and XMM register afaik you need to write/read them to memory and back. Or something to that effect

The register allocator currently doesnt know any of this, so it might need to be a fat primop rather than simple one if you want to get it working sooner than later .. (and the summer simd work will then subsume that i gues?)

Abhiroop added inline comments.Jun 19 2018, 5:01 PM
compiler/nativeGen/X86/Instr.hs
353

Not sure if I am missing something but from the docs I can see the PCMPEQ family has one source operand and one destination operand. https://www.felixcloutier.com/x86/PCMPEQB:PCMPEQW:PCMPEQD.html

You seem to be using two source operands which looks to me as one of the causes of the error.

Apart from that, the doc states PCMPEQB mm, mm/m64, as you must have read mm refers to MMX registers. In Codegen.hs you are using the getNewRegNat function to request for a register which provides you a virtual register which is eventually mapped to a real register by the register allocator. However I don't know if currently GHC has support for allocation of MMX registers. Maybe @carter or someone else can comment on how GHC handles MMX registers currently.

If you compare your output with GCC's output :

Currently : pcmpeqb %bl,%al,%al
GCC: pcmpeqb %mm1, %mm0

As you can see apart from the error in the number of operands, GCC is actually using the MMX register mm1 and mm0 whereas you are using the general registers.

Perhaps instead of working with the -mmx flag and family you can try using the SSE2 based PCMPEQB xmm1, xmm2/m128.

Like @carter said currently GHC has no understanding about how integers can reside in XMM registers. You would need to port an instruction like MOVDQU xmm2/m128, xmm1 which can read from memory to an XMM register and then use the PCMPEQB instruction.

This can definitely be folded into the work that I am doing currently.

Thanks for your advice.

I'd like to park this differential for the moment and move on to _mm512_shuffle_epi8 (or vpshufb), which is a closer match to the AVX work that @Abhiroop is doing on D4813 and come back to this when I better understand the register situation.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=765,767,802,3894,802,767,4757,4757&text=_mm512_shuffle_epi8

Thanks for your advice.

I'd like to park this differential for the moment and move on to _mm512_shuffle_epi8 (or vpshufb), which is a closer match to the AVX work that @Abhiroop is doing on D4813 and come back to this when I better understand the register situation.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=765,767,802,3894,802,767,4757,4757&text=_mm512_shuffle_epi8

https://ghc.haskell.org/trac/ghc/ticket/15250

Thanks for your advice.

I'd like to park this differential for the moment and move on to _mm512_shuffle_epi8 (or vpshufb), which is a closer match to the AVX work that @Abhiroop is doing on D4813 and come back to this when I better understand the register situation.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=765,767,802,3894,802,767,4757,4757&text=_mm512_shuffle_epi8

https://ghc.haskell.org/trac/ghc/ticket/15250

hrmmm, i suspect the AVX512 stuff may not be terrible performant / some / many intel CPUS in consumer / dev machine hands wont support them, have you tested using it via C FFI for your inner loops first?

bgamari requested changes to this revision.Oct 28 2018, 6:48 PM

What is the status of this, @newhoggy?

This revision now requires changes to proceed.Oct 28 2018, 6:48 PM

I'm not currently working on this.

However, I'd be interested in resuming work if the @Abhiroop's refactoring around this area is complete to the point where I can implement integer SIMD primops without difficulty.

Last time I left, I believe there were some issues around something like constant folding that wasn't working properly for SIMD integers.

@newhoggy The implementation of the SIMD integers are dependent on the landing of the Int8#, Int16# etc patches. Once those land I believe we can make some progress in implementing SIMD integer support.

Thanks for the update @Abhiroop!