Add support for faster copies via Intel Extended REP-MOVSB

Authored by austin on Jun 5 2014, 9:24 AM.



On Ivy Bridge and newer processes (i.e. post 2nd gen Core i7), the x86 instruction rep movsb has become shockingly fast, almost as fast as a hand-written and tuned AVX memcpy primitive. These changes add support for -march and -mcpu flags to GHC (mimicking the semantics of GCC's flags), and when -mcpu or -march is set to Ivy Bridge or later, these instructions are implicitly used for all copies.

Note carefully that ERMSB support is controlled my -mcpu and not -march - -mcpu technically only affects things like instruction scheduling and selection, while -march controls what instructions are available (as they might not exist on older platforms). rep movsb isn't a new instruction; it's just faster now. So we select based on the CPU tuning selection, not based on whether we can use any Ivy Bridge instructions in general.

In particular, applications can use this with full backwards compatibility, although copies may be slower on non Ivy Bridge machines.

This doesn't add support for memset yet (rep stosb), but that can come later.


  • Add support for memset via rep stosb
  • Rework the names to be more generic (cc @carter)
  • Add some should_gen_asm tests.
  • Benchmarks. unordered-containers is copy sensitive so it makes a good candidate.
Test Plan

Build the compiler, compile things, and they might go faster.

Diff Detail

rGHC Glasgow Haskell Compiler
No Linters Available
No Unit Test Coverage
austin updated this revision to Diff 7.Jun 5 2014, 9:24 AM
austin retitled this revision from to Add support for faster copies via Intel Extended REP-MOVSB.
austin updated this object.
austin edited the test plan for this revision. (Show Details)
austin added a reviewer: hvr.
hvr edited edge metadata.Jun 5 2014, 9:38 AM

I've given it a quick glance and have only 2 questions so far...


Off-topic since not introduced by the patch at hand, but can't we avoid the wildcard fall-through, and have GHC warn us about unhandled cases at compile time?


why the \t?

austin added inline comments.Jun 5 2014, 9:47 AM

I'd prefer doing that in general I think since we ./validate with -Werror anyway, but I think that's just "the way it's been" in the NCG for a long time. Maybe worth fixing on its own.


It's to ensure the instruction lines up with others when pretty printed; pprSizeOp_ & co. will do this for you when printing, but ptext by default will not. Otherwise you get something like:

    ... stuff ...
    mov edi, ...
    mov esi, ...
rep movsb
    ... more stuff ...

which looks ugly.

hvr accepted this revision.Jun 5 2014, 9:48 AM
hvr edited edge metadata.
This revision is now accepted and ready to land.Jun 5 2014, 9:48 AM
carter added a subscriber: carter.Jun 5 2014, 11:17 PM

I like this! (the march tooling and this workflow)


is calling them IntelCPU / IntelFeature quite accurate? There are feature combos that eg only happen on AMD chips etc

austin updated this object.Jun 6 2014, 1:17 AM
austin edited edge metadata.
austin changed the visibility from "All Users" to "Public (No Login Required)".Jun 6 2014, 1:44 AM
austin added inline comments.Jun 6 2014, 1:49 AM

This could probably be changed to something like 'X86...' in such a case, but at the time I didn't particularly care considering this feature is intel specific anyway.

austin added a comment.Jun 6 2014, 1:52 AM

What about AMD?

At the moment I don't own AMD hardware, nor am I aware of any equivalent to ERMSB at least on any current or future AMD processors.

But in general extending the flags to support AMD CPU descriptions would be good. It's just not something I've done yet - the GCC docs should have a relatively complete list, though.

Right now this is really just for the fast copy support; we should also investigate passing the same options to the C compiler (as well as LLVM) too but that's another changeset anyway. But I can do the cleanup @carter mentioned at least.

austin updated this object.Jun 6 2014, 1:55 AM
austin updated this object.Jun 6 2014, 3:11 AM
austin updated this revision to Diff 8.Jun 6 2014, 4:23 AM
  • ghc: s/Intel/x86/ to be more agnostic to AMD platforms
austin updated this object.Jun 6 2014, 4:24 AM
austin added a reviewer: tibbe.Jun 9 2014, 2:13 AM

Add Johan to review.

tibbe edited edge metadata.Jun 9 2014, 2:36 AM

The only issue I can see is computing register usage correctly. You're reading some fixed registers to compute the addresses.


Optional: several unitOLs can be replaced by a single toOL block.


This looks wrong, you're definitely at least reading some (fixed) registers to compute the memory addresses.

tibbe requested changes to this revision.Jun 9 2014, 3:21 AM
tibbe edited edge metadata.
This revision now requires changes to proceed.Jun 9 2014, 3:21 AM
simonmar edited edge metadata.Jun 9 2014, 3:36 AM

Looking good. Don't forget about documentation for -mcpu and -march.


The point of getAnyReg is that you don't need these MOVs, you just apply code_dst to edi (wrapped in something, I forget exactly what) and it computes that operand into the appropriate register, adding a MOV only if necessary.


Add a helper to do this in Platform.hs


The \t should be unnecessary

austin abandoned this revision.Aug 19 2014, 12:16 AM

Superseded by D165 and D166.