On Ivy Bridge and newer processes (i.e. post 2nd gen Core i7), the x86 instruction rep movsb has become shockingly fast, almost as fast as a hand-written and tuned AVX memcpy primitive. These changes add support for -march and -mcpu flags to GHC (mimicking the semantics of GCC's flags), and when -mcpu or -march is set to Ivy Bridge or later, these instructions are implicitly used for all copies.
Note carefully that ERMSB support is controlled my -mcpu and not -march - -mcpu technically only affects things like instruction scheduling and selection, while -march controls what instructions are available (as they might not exist on older platforms). rep movsb isn't a new instruction; it's just faster now. So we select based on the CPU tuning selection, not based on whether we can use any Ivy Bridge instructions in general.
In particular, applications can use this with full backwards compatibility, although copies may be slower on non Ivy Bridge machines.
This doesn't add support for memset yet (rep stosb), but that can come later.
- Add support for memset via rep stosb
- Rework the names to be more generic (cc @carter)
- Add some should_gen_asm tests.
- Benchmarks. unordered-containers is copy sensitive so it makes a good candidate.