Handle the likely:True case in CmmContFlowOpt
ClosedPublic

Authored by AndreasK on Jan 14 2018, 1:56 PM.

Details

Summary

It's better to fall through to the likely case than to jump to it.

We optimize for this in CmmContFlowOpt when likely:False.
This commit extends the logic there to handle cases with likely:True
as well.

Test Plan

ci

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
AndreasK created this revision.Jan 14 2018, 1:56 PM

Can we add some representative nofib changes to the commit message?

IIRC the only place where likeliness is currently used is in the heap/stack-checks (c.f. https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmHeap.hs#L628). Though these assume the condition won't evaluate to true (== Just False) so these won't trigger here.

IIRC the only place where likeliness is currently used is in the heap/stack-checks (c.f. https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmHeap.hs#L628). Though these assume the condition won't evaluate to true (== Just False) so these won't trigger here.

I believe @AndreasK has been looking into adding likelihoods in additional places.

IIRC the only place where likeliness is currently used is in the heap/stack-checks (c.f. https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmHeap.hs#L628). Though these assume the condition won't evaluate to true (== Just False) so these won't trigger here.

Compiling this:

{-# NOINLINE f_test #-}
f_test :: Int ->Int
f_test a = case a of
    1  -> 11001
    2  -> 11002
    _  -> -1


main :: IO ()
main = do
    print . sum . map (f_test) $ (concat . replicate 9000000) [1..45::Int]

Generates two likely: True branches for me

...
c4pb: // global
           _s4nA::I64 = R3;   // CmmAssign
           _s4nz::P64 = R2;   // CmmAssign
           if ((Sp + -24) >= SpLim) (likely: True) goto c4p1; else goto c4pc;   // CmmCondBranch
...

Very possible these just happen because of Cmm transformations but either way in the example code above it has a significant performance impact on my machine.

Can we add some representative nofib changes to the commit message?

I will try to run it in the next few days.

@AndreasK Interesting! I never found a a way to exercise this transformation. It is part of CmmContFlowOpt (see https://github.com/ghc/ghc/blob/master/compiler/cmm/CmmContFlowOpt.hs#L291).

Btw. the conditions for the optimization to kick in seem overly restrictive to me:

, numPreds f > 1
, hasOnePredecessor t

Why would the false branch need more than one predecessor while the true branch is fine with exactly one?

AndreasK added a comment.EditedJan 17 2018, 2:37 AM

Can we add some representative nofib changes to the commit message?

Edit: Turns out I had uncommited changes in the working dir. I will make another run ...

Full nofib run with NoFibRuns=15

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
             CS          +0.0%      0.0%     0.106     0.101      0.0%
            CSD          +0.0%      0.0%     +0.4%     +0.9%      0.0%
             FS          +0.0%      0.0%     -5.6%     -4.8%      0.0%
              S          +0.0%      0.0%     -0.8%     +0.2%      0.0%
             VS          +0.0%      0.0%     -0.9%     -0.8%      0.0%
            VSD          +0.0%      0.0%     0.005     0.006      0.0%
            VSM          +0.0%      0.0%     +1.1%     +1.8%      0.0%
           anna          +0.0%      0.0%     0.048     0.049      0.0%
           ansi          +0.0%      0.0%     0.000     0.000      0.0%
           atom          +0.0%      0.0%     0.119     0.117      0.0%
         awards          +0.0%      0.0%     0.004     0.000      0.0%
         banner          +0.0%      0.0%     0.003     0.001      0.0%
     bernouilli          +0.0%      0.0%     0.076     0.076      0.0%
   binary-trees          +0.0%      0.0%     +1.2%     -0.2%      0.0%
          boyer          +0.0%      0.0%     0.017     0.018      0.0%
         boyer2          +0.0%      0.0%     0.006     0.003      0.0%
           bspt          +0.0%      0.0%     0.003     0.004      0.0%
      cacheprof          +0.0%     -0.1%     0.155     0.181     +0.6%
       calendar          +0.0%      0.0%     0.001     0.000      0.0%
       cichelli          +0.0%      0.0%     0.032     0.032      0.0%
        circsim          +0.0%      0.0%     +0.8%     +0.0%      0.0%
       clausify          +0.0%      0.0%     0.013     0.014      0.0%
  comp_lab_zift          +0.0%      0.0%     0.079     0.078      0.0%
       compress          +0.0%      0.0%     0.059     0.076      0.0%
      compress2          +0.0%      0.0%     0.055     0.079      0.0%
    constraints          +0.0%      0.0%     +1.1%     +0.5%      0.0%
   cryptarithm1          +0.0%      0.0%     -3.3%     -2.2%      0.0%
   cryptarithm2          +0.0%      0.0%     0.002     0.003      0.0%
            cse          +0.0%      0.0%     0.001     0.001      0.0%
   digits-of-e1          +0.0%      0.0%     +1.5%     +2.4%      0.0%
   digits-of-e2          +0.0%      0.0%     +0.3%     +1.1%      0.0%
          eliza          +0.0%      0.0%     0.000     0.001      0.0%
          event          +0.0%      0.0%     0.059     0.062      0.0%
    exact-reals          +0.0%      0.0%     +0.4%     +0.2%      0.0%
         exp3_8          +0.0%      0.0%     0.112     0.113      0.0%
         expert          +0.0%     -0.1%     0.002     0.001      0.0%
 fannkuch-redux          +0.0%      0.0%     +0.4%     +0.3%      0.0%
          fasta          +0.0%      0.0%     +1.7%     +1.6%      0.0%
            fem          +0.0%      0.0%     0.009     0.010      0.0%
            fft          +0.0%      0.0%     0.019     0.015      0.0%
           fft2          +0.0%      0.0%     0.026     0.025      0.0%
       fibheaps          +0.0%      0.0%     0.011     0.011      0.0%
           fish          +0.0%      0.0%     0.000     0.007      0.0%
          fluid          +0.0%      0.0%     0.004     0.004      0.0%
         fulsom          +0.0%      0.0%     0.124     0.126      0.0%
         gamteb          +0.0%      0.0%     0.016     0.018      0.0%
            gcd          +0.0%      0.0%     0.022     0.018      0.0%
    gen_regexps          +0.0%      0.0%     0.001     0.000      0.0%
         genfft          +0.0%      0.0%     0.013     0.013      0.0%
             gg          +0.0%      0.0%     0.012     0.005      0.0%
           grep          +0.0%      0.0%     0.001     0.000      0.0%
         hidden          +0.0%      0.0%     0.160     0.166      0.0%
            hpg          +0.0%      0.0%     0.079     -0.6%      0.0%
            ida          +0.0%      0.0%     0.035     0.037      0.0%
          infer          +0.0%      0.0%     0.021     0.023      0.0%
        integer          +0.0%      0.0%     +0.6%     -0.0%      0.0%
      integrate          +0.0%      0.0%     0.031     0.045      0.0%
   k-nucleotide          +0.0%      0.0%     +1.8%     +1.4%      0.0%
          kahan          +0.0%      0.0%     0.146     0.145      0.0%
        knights          +0.0%      0.0%     0.001     0.002      0.0%
         lambda          +0.0%      0.0%     -1.8%     -1.2%      0.0%
     last-piece          +0.0%      0.0%     -0.8%     -0.1%      0.0%
           lcss          +0.0%      0.0%     0.171     0.184      0.0%
           life          +0.0%      0.0%     0.102     0.102      0.0%
           lift          +0.0%      0.0%     0.001     0.001      0.0%
         linear          +0.0%      0.0%     -1.5%     -1.3%      0.0%
      listcompr          +0.0%      0.0%     0.040     0.051      0.0%
       listcopy          +0.0%      0.0%     0.031     0.055      0.0%
       maillist          +0.0%      0.0%     0.055     +0.2%      0.0%
         mandel          +0.0%      0.0%     0.027     0.029      0.0%
        mandel2          +0.0%      0.0%     0.002     0.002      0.0%
           mate          +0.0%      0.0%     +1.4%     +1.5%      0.0%
        minimax          +0.0%      0.0%     0.003     0.001      0.0%
        mkhprog          +0.0%      0.0%     0.001     0.001      0.0%
     multiplier          +0.0%      0.0%     0.038     0.044      0.0%
         n-body          +0.0%      0.0%     -0.2%     +0.2%      0.0%
       nucleic2          +0.0%      0.0%     0.029     0.030      0.0%
           para          +0.0%      0.0%     0.114     0.145      0.0%
      paraffins          +0.0%      0.0%     0.047     0.052      0.0%
         parser          +0.0%      0.0%     0.016     0.014      0.0%
        parstof          +0.0%      0.0%     0.002     0.003      0.0%
            pic          +0.0%      0.0%     0.007     0.004      0.0%
       pidigits          +0.0%      0.0%     +0.7%     -0.4%      0.0%
          power          +0.0%      0.0%     0.162     0.160      0.0%
         pretty          +0.0%      0.0%     0.000     0.000      0.0%
         primes          +0.0%      0.0%     0.032     0.033      0.0%
      primetest          +0.0%      0.0%     0.046     0.046      0.0%
         prolog          +0.0%      0.0%     0.002     0.001      0.0%
         puzzle          +0.0%      0.0%     0.055     0.054      0.0%
         queens          +0.0%      0.0%     0.005     0.007      0.0%
        reptile          +0.0%      0.0%     0.005     0.006      0.0%
reverse-complem          +0.0%      0.0%     0.056     0.104      0.0%
        rewrite          +0.0%      0.0%     0.003     0.006      0.0%
           rfib          +0.0%      0.0%     0.009     0.005      0.0%
            rsa          +0.0%      0.0%     0.011     0.011      0.0%
            scc          +0.0%      0.0%     0.000     0.000      0.0%
          sched          +0.0%      0.0%     0.011     0.008      0.0%
            scs          +0.0%      0.0%     -4.3%     -0.1%      0.0%
         simple          +0.0%      0.0%     0.081     0.085      0.0%
          solid          +0.0%      0.0%     0.056     0.055      0.0%
        sorting          +0.0%      0.0%     0.002     0.001      0.0%
  spectral-norm          +0.0%      0.0%     -0.2%     +0.1%      0.0%
         sphere          +0.0%      0.0%     0.020     0.021      0.0%
         symalg          +0.0%      0.0%     0.011     0.006      0.0%
            tak          +0.0%      0.0%     0.007     0.004      0.0%
      transform          +0.0%      0.0%     0.141     0.146      0.0%
       treejoin          +0.0%      0.0%     0.058     0.066      0.0%
      typecheck          +0.0%      0.0%     0.111     0.111      0.0%
        veritas          +0.0%      0.0%     0.001     0.001      0.0%
           wang          +0.0%      0.0%     0.039     0.040      0.0%
      wave4main          +0.0%      0.0%     0.096     0.103      0.0%
   wheel-sieve1          +0.0%      0.0%     -1.4%     -1.1%      0.0%
   wheel-sieve2          +0.0%      0.0%     0.085     0.094      0.0%
           x2n1          +0.0%      0.0%     0.001     0.002      0.0%
--------------------------------------------------------------------------------
            Min          +0.0%     -0.1%     -5.6%     -4.8%      0.0%
            Max          +0.0%      0.0%     +1.8%     +2.4%     +0.6%
 Geometric Mean          +0.0%     -0.0%     -0.3%     -0.0%     +0.0%
AndreasK edited the summary of this revision. (Show Details)Jan 17 2018, 7:51 AM
bgamari accepted this revision.Jan 21 2018, 10:56 AM
bgamari added a reviewer: simonmar.
bgamari added a subscriber: simonmar.

Looks reasonable to me.

Any objections, @simonmar?

This revision is now accepted and ready to land.Jan 21 2018, 10:56 AM

Not sure about this. We already have code to do something similar here:

https://phabricator.haskell.org/diffusion/GHC/browse/master/compiler/cmm/CmmContFlowOpt.hs;24e56ebd010846683b236b6ef3678c2217640120$286-300

and indeed the version in ContFlowOpt is better because it takes into account the number of predecessors of the branch targets. This version could end up making things worse if the true branch doesn't turn into a fallthrough, which might be the case if it has multiple predecessors.

Not sure about this. We already have code to do something similar here:

https://phabricator.haskell.org/diffusion/GHC/browse/master/compiler/cmm/CmmContFlowOpt.hs;24e56ebd010846683b236b6ef3678c2217640120$286-300

and indeed the version in ContFlowOpt is better because it takes into account the number of predecessors of the branch targets.

If we have a single conditional if (cond) (likely: True) goto c1oP; else goto c1oQ the linked optimization can't kick in (because it's likelyTrue) no matter what.

This version could end up making things worse if the true branch doesn't turn into a fallthrough, which might be the case if it has multiple predecessors.

I have to admit I didn't consider this.

We should still at least invert the conditions if there are no other predecessor for a conditional with likely True.

By the looks of it the last optimization pass we run is the second control flow optimization pass, so there should be no risk of it being swapped again by other optimizations as well.

I've looked into it a bit more.

swapcond_last should improve code where there is a contention between a unconditional jump and a conditional.
But it also makes contention between two conditionals worse.

Has anyone actually meassured the impact it has on performance?

As the only place where we currently have annotations is Stack/Heap checks which are very unlikely to violate the
expected result so inverting these wouldn't be much different than removing the fall through from a unconditional
jump.

Example below

foo() {
e:
	if (R1 == 1) (likely: True) { goto l1; };
	if (R1 == 2) (likely: False) { goto l2; } else { goto l1;};
l1:
	R2 = 111;
	return();
l2:
	R2 = 222;
	return();
}

This currently falls through to l2. The least likely case to happen.

        cmpq $1,%rbx
        je _c8

        cmpq $2,%rbx
        jne _c8

        movl $222,%r14d
        jmp *(%rbp)
_c8:
        movl $111,%r14d
        jmp *(%rbp)

In the past swapcond_last only applied to conditions without likeliness annotations. I think applying it to Likely: False as well was a oversight.

Yes, I think swapcond_last is actually wrong, thanks for pointing it out. Looks like it was changed in D2688, cc @alexbiehl

I propose this condition:

| CmmCondBranch cond t f l <- shortcut_last
, hasOnePredecessor t -- inverting will make t a fallthrough
, likelyTrue l || (isNothing l && numPreds f > 1)
, Just cond' <- maybeInvertCmmExpr cond
= CmmCondBranch cond' f t (invertLikeliness l)

To paraphrase, we'll invert the condition if:

  • the inverted condition will be a fallthrough, and
  • the true branch is likely (so we make the likely branch the fallthrough) OR we have no likeliness information and the false branch might not be a fallthrough (so we make the code simpler by creating a fallthrough)

And to answer your question,

Has anyone actually meassured the impact it has on performance?

when I added this optimisation I think I checked its effect on code size but not performance, because I was mostly concerned with getting code out of the new code generator that was equivalent to the old code generator, and this was one case where we were worse. It certainly wouldn't hurt to measure it.

Comparison with transform on False (baseline) and transform on Nothing

. Run with NoFibRuns=30.

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
-----------------------------------------------------------------------
            Min           0.0%     -0.0%     -2.7%     -1.3%     -0.6%
            Max          +0.0%      0.0%     +1.8%     +1.9%     +0.8%
 Geometric Mean          +0.0%     -0.0%     -0.1%     +0.2%     +0.0%

However I've run these on windows and the RTS timings are a bit wonky there for the faster benchmarks so it's neither here nor there.

Yes, I think swapcond_last is actually wrong, thanks for pointing it out. Looks like it was changed in D2688, cc @alexbiehl

I propose this condition:

| CmmCondBranch cond t f l <- shortcut_last
, hasOnePredecessor t -- inverting will make t a fallthrough
, likelyTrue l || (isNothing l && numPreds f > 1)
, Just cond' <- maybeInvertCmmExpr cond
= CmmCondBranch cond' f t (invertLikeliness l)

To paraphrase, we'll invert the condition if:

  • the inverted condition will be a fallthrough, and
  • the true branch is likely (so we make the likely branch the fallthrough) OR we have no likeliness information and the false branch might not be a fallthrough (so we make the code simpler by creating a fallthrough)

Going by the nofib run I made using likelyFalse instead of isNothing would also be ok. But the difference seems small enough that it's pretty much a tossup and could be the other way around on a different CPU.

Other thoughts:

  • If we have >1 pred on both paths I think it would be better to have the likely path take the first jump.

That way we have:

cmp
jmp taken

instead of

cmp
jmp not taken
jmp taken

on the hot path. With some luck one of the other predecessors is a unconditional goto anyway so the fall through wouldn't disappear as well.

AndreasK updated this revision to Diff 15221.Jan 25 2018, 11:21 AM
  • Move inversion of conditional for likely:True to CmmContFlowOpt

I've removed the isNothing condition.

  • For Nothing we want to invert anyway.
  • For (Just False) it seems the code is about as fast or faster (see nofib above) if we invert (and the code gets smaller as well).

I haven't looked into the case where both have multiple predecessors but added it to the note. If GHC ends up making more use of the likely information in the Future it is probably worth revisiting.

AndreasK retitled this revision from Respect the Cmm likely flag in the native x86 codegen. to Handle the likely:True case in CmmContFlowOpt.Jan 25 2018, 12:16 PM
AndreasK edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.
This comment was removed by AndreasK.