Invert FP conditions to eliminate the explicit NaN check.
ClosedPublic

Authored by AndreasK on Jul 19 2018, 7:11 PM.

Details

Summary

Optimisation: we don't have to test the parity flag if we
know the test has already excluded the unordered case: eg >
and >= test for a zero carry flag, which can only occur for
ordered operands.

By reversing comparisons we can avoid testing the parity
for < and <= as well. This works since:

  • If any of the arguments is an NaN CF gets set. Resulting in a false result.
  • Since this allows us to rule out NaN we can exchange the arguments and invert the direction of the arrows.
Test Plan

ci/nofib

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
AndreasK created this revision.Jul 19 2018, 7:11 PM

At least imaginary/paraffins is still wrong with this patch.

But I will look into this in the next days.

AndreasK updated this revision to Diff 17406.Jul 20 2018, 3:24 AM
x < y == y > x instead of x < y == y >= x.

Don't write patches late at night ...

AndreasK retitled this revision from WIP: While the idea is sound there is still a bug in here. [skip ci] Invert FP conditions to reduce conditional instruction count. to WIP: While the idea is sound there is still a bug in here. [skip ci]Invert FP conditions to reduce conditional instruction count..Jul 20 2018, 9:25 AM
AndreasK edited the summary of this revision. (Show Details)
AndreasK updated this revision to Diff 17421.Jul 20 2018, 12:15 PM
  • Fix a leftover GTT/GE mixup.
AndreasK retitled this revision from WIP: While the idea is sound there is still a bug in here. [skip ci]Invert FP conditions to reduce conditional instruction count. to Invert FP conditions to eliminate the explicit NaN check..Jul 21 2018, 2:51 AM
AndreasK edited the summary of this revision. (Show Details)
AndreasK updated this revision to Diff 17422.Jul 21 2018, 4:43 AM
  • ASSERT that comparisons have been swapped as expected.
jmct added a subscriber: jmct.EditedJul 30 2018, 8:29 AM

This looks good to me. Ready for merge from my perspective.

AndreasK updated this revision to Diff 17612.Aug 9 2018, 8:01 AM
  • Remove space after assert

I'm not qualified to review this, but it looks like a step forward.

It needs a regression test though!

Thanks!

compiler/nativeGen/X86/CodeGen.hs
1370–1373

Perhaps say

-- Invert comparison condition and swap operands
-- See Note [SSE Parity Checks]
2955

Could you give an example or two to illustrate?

And refer to the Trac ticket

AndreasK updated this revision to Diff 17696.Aug 18 2018, 2:29 PM
  • Update note
  • Add regression test
AndreasK marked 2 inline comments as done.Aug 18 2018, 2:30 PM

Thanks for the feedback!

alpmestan accepted this revision.Aug 22 2018, 10:44 AM
alpmestan added a subscriber: alpmestan.

Out of curiosity, if this is a _performance_ optimisation, do we have any numbers we can leave here for posterity?

This revision is now accepted and ready to land.Aug 22 2018, 10:44 AM
AndreasK added a comment.EditedAug 22 2018, 11:35 AM

Out of curiosity, if this is a _performance_ optimisation, do we have any numbers we can leave here for posterity?

No numbers. I thought about constructing a benchmark but it's rather pointless.

The expected impact is rather small.

  • Usually it will allow the cpu to push through one more uOP (not instruction!) per iteration in loops.
  • If your lucky with instruction sheduling it will save a full cycle per iteration.
  • If your really lucky the change frees up resources in the branch predictor/cache avoiding a cache miss or branch missprediction.
  • If your unlucky then the old code was some sort of local optimum and performance will get slightly worse. Either adding a <= 1 cycle per iteration or more in extrem cases where you get unlucky with alignment/cache.
  • Or you are bound by FP execution ports anyway and it doesn't make a difference at all.

For an example I had a case where I optimized away a jump in an 8 instruction loop which made performance worse.
Turned out removing the jump meant the instructions were no longer perfectly distributed over the CPU's execution ports leading to worse
utilization.

While we can construct synthetic benchmarks to show the above with a lot of effort, it's neither worth it nor would it be representative of the expected impact in real code.

That was my suspicion, thanks for clarifying!

This revision was automatically updated to reflect the committed changes.