WIP: Implement cmov support.
Changes PlannedPublic

Authored by AndreasK on Jun 11 2018, 1:24 PM.

Details

Reviewers
jmct
bgamari
simonmar
Trac Issues
#15258
Summary

Find branches which can be eliminated by using cmov and do so.
We do this by looking for control flow splits where both branches
are only doing assignments. If these can reasonably
be translated into one block using cmovs then do so.

For this we introduce a new Cmm Node CmmCondAssign.
CmmCondAssign contains:

  • The condition expression
  • A list of registers and the expressions they should be assigned for each branch.
  • We maintain the invarant that when doing the assignments in sequence that there can be no true dependency.

This is not finished yet. Among other things:

  • This was only test on x64, and probably also only works there. So I still have to add checks for other platforms and implement codeGen for llvm.
  • There is still tracing code in here.
  • This was written with compiler performance at most being an afterthought. So I expect compiler performance to suffer somewhat.
  • We might limit cmov transformation to fewer cases than neccesary.
  • In particular we only allow non floating point comparisons as the condition. (To avoid the unordered number nastiness)
  • We don't limit the length of the branchs which we probably should. At some point calculating more expression will be more expensive than the occasional pipeline stall.
  • We don't take advantage of the fact that cmov can take a memory operand.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
cmov
Lint
Lint WarningsExcuse: ...
SeverityLocationCodeMessage
Warningcompiler\cmm\CmmCmov.hs:145TXT3Line Too Long
Warningcompiler\cmm\CmmCmov.hs:206TXT3Line Too Long
Warningcompiler\cmm\PprCmm.hs:226TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 21207
Build 47693: [GHC] Linux/amd64: Continuous Integration
Build 47692: [GHC] OSX/amd64: Continuous Integration
Build 47691: [GHC] Windows/amd64: Continuous Integration
Build 47690: arc lint + arc unit
AndreasK created this revision.Jun 11 2018, 1:24 PM
AndreasK added a comment.EditedJun 12 2018, 4:33 AM

There is a bit of noise in the results but here are the nofib measurements.
Individual benchmarks listed are these where cmov was generated and
it made a difference.

 fannkuch-redux           0.0%      0.0%     -0.6%     -0.6%      0.0%
          fasta          -0.0%      0.0%     -2.4%     -2.5%      0.0%
         lambda          +0.0%      0.0%     -2.5%     -2.5%      0.0%
           mate          -0.0%      0.0%     +2.4%     +2.4%      0.0%
    constraints           0.0%      0.0%     -0.9%     -0.9%      0.0%
    exact-reals           0.0%      0.0%     -0.0%     -0.1%      0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.0%     -5.5%     -5.5%    -15.5%
            Max          +0.0%     +0.0%     +2.4%     +2.4%     +2.0%
 Geometric Mean          -0.0%     -0.0%     -0.4%     -0.4%     -0.1%

Both the -15% mem usage and the -5.5% result seem to be the result from noise as I couldn't reproduce them.

For the listed individual benchmarks the general trend remains when rerun even if the numbers change slightly.

AndreasK updated this revision to Diff 16847.Jun 12 2018, 4:37 AM
  • Refactor CmmCondAssign type.

Given that the numbers of expressions and target registers will always match
[(reg,expr)] is a far better fit.

AndreasK updated this revision to Diff 16852.Jun 12 2018, 7:20 AM
  • Some refactoring.
  • Also tackle branching code where one branch is empty.
  • Allocations are still far too high so there is some need for optimization.
  • Currently only deals with integer conditions
  • Assignments which read from memory should be directly translated to cmovs. Currently they also require an intermediate register.
AndreasK planned changes to this revision.Jun 12 2018, 7:23 AM
AndreasK added a comment.EditedJun 14 2018, 7:26 AM

More things to consider:

  • CMOV doesn't support 8 bit operands.
  • CMOV with a memory address does NOT perform a conditional load. But a load followed from a conditional move.
    • Some simple cases could be better encoded using things like setcc.