WIP: Add likelyhood to alternatives from stg onwards
Needs ReviewPublic

Authored by AndreasK on Jan 19 2018, 11:43 AM.

Details

Summary

Adds a Freq value to Stg/Cmm cases/switches/conditionals.

Currently only generates these values by checking alternatives for
bottom expressions.

They are passed along to the backend where they affect conditional generation
slightly.

As it stands runtime improvements seem to be less than expected. This might only be worth merging once we have more branch weights available.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
likelyStg
Lint
Lint WarningsExcuse: wip
SeverityLocationCodeMessage
Warningcompiler\cmm\CmmSwitch.hs:27TXT3Line Too Long
Warningcompiler\cmm\CmmSwitch.hs:559TXT3Line Too Long
Warningcompiler\cmm\CmmSwitch.hs:572TXT3Line Too Long
Warningcompiler\cmm\CmmSwitch.hs:576TXT3Line Too Long
Warningcompiler\coreSyn\CorePrep.hs:200TXT3Line Too Long
Warningcompiler\coreSyn\CorePrep.hs:220TXT3Line Too Long
Warningcompiler\simplCore\HotCode.hs:65TXT3Line Too Long
Warningcompiler\simplCore\HotCode.hs:126TXT3Line Too Long
Warningcompiler\simplCore\HotCode.hs:136TXT3Line Too Long
Warningcompiler\stgSyn\CoreToStg.hs:652TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 20828
Build 46218: [GHC] Linux/amd64: Continuous Integration
Build 46217: [GHC] OSX/amd64: Continuous Integration
Build 46216: [GHC] Windows/amd64: Continuous Integration
Build 46215: arc lint + arc unit
AndreasK created this revision.Jan 19 2018, 11:43 AM

Thanks for trying this. If that 0.8% is real, it's pretty good.

I'm really quite reluctant to turn Core case alternatives into a 4-tuple. As you found, it's a pretty invasive change. STG is another matter; I'm fine with that.

Since we have exprIsBottom why not use it to identify infrequent branches in the core-to-STG pass? That way Core is unchanged and the patch is far far less invasive.

compiler/coreSyn/CoreSyn.hs
271

What is this comment doing?

AndreasK marked an inline comment as done.EditedJan 25 2018, 8:09 AM

Thanks for trying this. If that 0.8% is real, it's pretty good.

I'm really quite reluctant to turn Core case alternatives into a 4-tuple. As you found, it's a pretty invasive change. STG is another matter; I'm fine with that.

Since we have exprIsBottom why not use it to identify infrequent branches in the core-to-STG pass? That way Core is unchanged and the patch is far far less invasive.

Pushing the use of exprIsBottom to find unlikely branches back to the Core-to-STG pass sounds better.

One idea I have is using pragmas to inform the compiler about likely branches.

case checkInput of
  ParseError {-# CHANCE rare -} 	-> printSuggestions
  ValueError {-# CHANCE unlikely -}     -> printConstraints
  Valid      {-# CHANCE likely -} 	-> runComputation

So far I couldn't think of a better way to preserve this information from HsSyn to the backend without putting it into Core itself.

But it's probably fine to do that in another patch.

Edit: The (strange) comment and the TODOFs are a result of work in Progress.
I used the comment to store the runtime of compilation. TODOF's are places I want to take another look before I consider this finished, as much of what I did was making the types match.

Pushing the use of exprIsBottom to find unlikely branches back to the Core-to-STG pass sounds better.

One idea I have is using pragmas to inform the compiler about likely branches.

case checkInput of
  ParseError {-# CHANCE rare -} 	-> printSuggestions
  ValueError {-# CHANCE unlikely -}     -> printConstraints
  Valid      {-# CHANCE likely -} 	-> runComputation

Have you considered how these annotations would behave under simplification (for instance, case-of-case)? It's not entirely clear to me how this should work.

Pushing the use of exprIsBottom to find unlikely branches back to the Core-to-STG pass sounds better.

One idea I have is using pragmas to inform the compiler about likely branches.

case checkInput of
  ParseError {-# CHANCE rare -} 	-> printSuggestions
  ValueError {-# CHANCE unlikely -}     -> printConstraints
  Valid      {-# CHANCE likely -} 	-> runComputation

Have you considered how these annotations would behave under simplification (for instance, case-of-case)? It's not entirely clear to me how this should work.

Case of case seems to be the rather unproblematic as we simply push the whole case into the alternatives. So we can just preserve the chance values.

Harder questions are things like how to desugar nested patterns like Right (Just True) .
There are multiple ways to interpret this annotation. Is it an error? Do we attribute it to all three matches? The first one? Something more complicated?

This is by no means a trivial problem. But I think it's very possible.
And since it's only hints there is always the option to drop the information for situations where we can't make heads or tails of how to handle it.

But that's still a good deal of work and probably more on the timescale of GHC 8.8 if even that fast.

Pushing the use of exprIsBottom to find unlikely branches back to the Core-to-STG pass sounds better.

One idea I have is using pragmas to inform the compiler about likely branches.

case checkInput of
  ParseError {-# CHANCE rare -} 	-> printSuggestions
  ValueError {-# CHANCE unlikely -}     -> printConstraints
  Valid      {-# CHANCE likely -} 	-> runComputation

Have you considered how these annotations would behave under simplification (for instance, case-of-case)? It's not entirely clear to me how this should work.

Case of case seems to be the rather unproblematic as we simply push the whole case into the alternatives. So we can just preserve the chance values.

Harder questions are things like how to desugar nested patterns like Right (Just True) .
There are multiple ways to interpret this annotation. Is it an error? Do we attribute it to all three matches? The first one? Something more complicated?

This is by no means a trivial problem. But I think it's very possible.
And since it's only hints there is always the option to drop the information for situations where we can't make heads or tails of how to handle it.

But that's still a good deal of work and probably more on the timescale of GHC 8.8 if even that fast.

It feels to me like this should be a Core annotation, for which we have the Tick form in Core. You could add a Tickish for likelihood, and then extract the information later in CoreToSTG to reconstruct the likelihood of branches. If we treat these annotations like SourceNotes, they would have no impact on optimisation and we just do a best-effort attempt to keep them consistent through simplifications.

AndreasK updated this revision to Diff 15265.Jan 29 2018, 5:11 AM
  • Move likeliness out of core for now.
  • Remove more TODOF's.
AndreasK retitled this revision from WIP: Add likelyhood to alternatives throughout all passes.. to WIP: Add likelyhood to alternatives from stg onwards.Jan 29 2018, 5:12 AM
AndreasK edited the summary of this revision. (Show Details)
simonpj accepted this revision.Jan 29 2018, 10:56 AM

Generally fine, thanks, but a few more signposts would be really helpful.

compiler/basicTypes/BasicTypes.hs
1615

Make this a Note

1626

Can you point to the moving parts? That is:

  • STG case alternatives get a Freq field
  • Some function (which?) uses the Freq to determine what code to generate. In that function, wherever it is, it'd be good to explain the strategy about how it uses the frequencies.
  • Refer to the Trac ticket

Lots of other functions get no-op modifications to pass Freq along, but it's very helpful to list the key places where things change.

1639

Refer to the Note from here

compiler/basicTypes/MkId.hs
360–366

Changes in here are just layout, which I don't object to, but would be better in a separate patch

compiler/cmm/CmmSwitch.hs
109

This is orthogonal, I know, but whenever a data constructor gets more than three (or even two) arguments I think it's extremely helpful to start using named fields.

compiler/stgSyn/CoreToStg.hs
475

Crumbs! This is an utterly key moment -- the unique place where you attach frequencies -- but has no commentary whatsoever. At least Freq overview Note, and explain that (for now at least) we are using a very simple strategy, namely identifying bottoming expressions.

compiler/stgSyn/StgSyn.hs
550

Point to the Freq overview Note

This revision is now accepted and ready to land.Jan 29 2018, 10:56 AM

Generally fine, thanks, but a few more signposts would be really helpful.

Will do. I appreciate the code review and that's why I put the diff up early!

Changes in here are just layout, which I don't object to, but would be better in a separate patch.

On my machine, for whatever reason, arc often fails to push to staging when
there are lint warnings. So this was primarily a workaround for shortcomings
of arc.

It feels to me like this should be a Core annotation, for which we have the Tick form in Core. You could add a Tickish for likelihood, and then extract the information later in CoreToSTG to reconstruct the likelihood of branches. If we treat these annotations like SourceNotes, they would have no impact on optimisation and we just do a best-effort attempt to keep them consistent through simplifications.

That is a good point.


For now my plan for this diff is:

  • Rerun nofib to make sure this improves things
  • Clean up the code
  • Get this into master

For changes beyond that I will open a new diff if I get to work on them.

Bumping out of the review q

Bumping out of the review q

bgamari requested changes to this revision.Jan 31 2018, 11:31 PM

For now my plan for this diff is:

Bumping out of the merge queue while this happens.

This revision now requires changes to proceed.Jan 31 2018, 11:31 PM
AndreasK marked 6 inline comments as done.Feb 10 2018, 5:18 PM
AndreasK added inline comments.
compiler/basicTypes/MkId.hs
360–366

I will try to get arc to push without these.

For now I will leave it be until it's about to land or I have some more numbers on performance.

AndreasK updated this revision to Diff 15427.Feb 11 2018, 3:47 AM
  • Remove all TODOF's.
  • Incorporate feedback from Phab review.
  • Make Freq a newtype and rename it to BranchWeight.
  • Output weights as llvm metadata
AndreasK updated this revision to Diff 15429.Feb 11 2018, 4:01 AM
  • Make unlikely bottoms a flag
  • Rename Freq to BranchWeight, Add Note, Make Weight newtype.
  • Expanded comments some more
  • Lint/Wall fixes
  • Remove last todofs
  • Lint suggestions applied
  • Rebased on master
AndreasK updated this revision to Diff 15430.Feb 11 2018, 4:12 AM
  • Remove unneccesary llvm metadata constructors
AndreasK updated this revision to Diff 15528.Feb 21 2018, 3:02 PM

Use branche weights when constructing a plan from a SwitchPlan.

  • Add branch weight to switchplan, this allows us to check for the heavier brance.
  • FindSingleValue checks for likely branches.

Updating the diff for review purposes only currently. [skip-ci]

I've run nofib overnight and the results were a bit underwhelming.
As it stands it's most likely not worth merging.

While it can improve edge cases a lot, overall the impact seems the be almost negligible on nofib.
I'm not sure if that is because of the nature of nofib or if it's just generally not worth it yet.

One thing I noticed is that quiet often we generate good code already "by accident" since things
like pattern match failures are usually jumped to by multiple branches hence it naturally is made
a jump instead of a fall through during Control flow optimization.

Elapsed Time
--------------------------------------------------------------------------------------------------------------------------------
                               unmodified1      unmodified2      unmodified3      unliklyBots1     unliklyBots2     unliklyBots3
--------------------------------------------------------------------------------------------------------------------------------
        -1 s.d.                -----            -0.5%            -0.1%            -0.5%            -0.4%            -0.6%
        +1 s.d.                -----            +1.1%            +0.3%            +0.5%            +0.5%            +0.6%
        Average                -----            +0.3%            +0.1%            -0.0%            +0.0%            -0.0%

These were taken with NoFibRuns=55 and three runs each so that I get an idea of how much noise there is. (Which seems to be very little)

It's less than I had hoped for, but two patched runs where faster than all unmodified runs, with the slowest one still beating two.
So there seems to be a speedup but it's almost insignificant for the NoFib suite.

"van" were runs with unlikely bottoms turned off, while the "bot" runs had them turned on.

Program            van     van2    van3    bot     bot2    bot3
lambda           0,612  -0,40%   0,00%   0,80%   0,80%   0,90%
fasta            0,337   0,30%   0,80%   0,80%   0,40%   0,80%
exact-reals      1.401   0,40%   0,10%   0,70%   1,00%   1,00%
linear           1.187  -0,10%   0,20%   0,60%   0,40%   0,20%
last-piece       0,247   1,00%   0,00%   0,30%   0,10%   0,20%
VS                0,73   0,10%   0,20%   0,20%   0,30%   0,20%
digits-of-e1      0,29   0,00%   0,00%   0,10%   0,10%   0,20%
wheel-sieve1     0,224   1,30%   0,10%   0,10%   0,10%   0,10%
n-body           1.081   0,00%   0,00%   0,10%   0,00%   0,00%
S                0,534   0,00%   0,10%   0,10%   0,00%   0,00%
digits-of-e2     0,427   0,20%   0,10%   0,10%  -0,10%   0,00%
pidigits         0,269  -0,20%   0,20%   0,10%  -0,30%   0,50%
mate             2.043   0,00%  -0,10%   0,00%   0,20%   0,20%
integer           0,93   0,50%   0,20%   0,00%   0,10%   0,10%
FS                0,23   0,00%   0,00%   0,00%   0,10%   0,10%
CSD               0,27   0,20%   0,10%   0,00%   0,00%  -0,10%
k-nucleotide     4.004   0,00%   0,00%   0,00%  -0,10%   0,10%
constraints      1.291   1,60%  -0,20%   0,00%  -0,40%  -0,40%
circsim          0,428   3,80%  -0,20%  -0,10%   0,20%   0,20%
fannkuch-redux   2.522  -0,10%   0,00%  -0,10%  -0,10%   0,00%
spectral-norm    0,464   0,00%   0,00%  -0,10%  -0,10%   0,00%
VSM              0,551   0,00%   0,10%  -0,30%   0,00%   0,00%
binary-trees     0,357  -0,30%  -0,20%  -0,30%  -0,10%   0,00%
scs              0,219   0,00%  -0,20%  -0,30%  -0,30%  -0,10%
maillist         0,418   0,00%   0,30%  -0,30%  -0,40%  -2,00%
hpg              0,397  -0,20%   0,50%  -0,90%   0,20%  -0,30%
cryptarithm1      0,25   0,60%  -0,20%  -1,90%  -1,70%  -1,90%

I assume the loses are from code alignment changes.

In particular I looked at lambda which runs two independent functions after each other.
After removing either one the patched code runs faster.

This was run using Flavour=Quick and libraries built with unlikely bottoms.

For comparison the Program below shows a speedup of >10% between the flag on and off (which seems to originate in the toEnum code for the most part). 1)

-- main.exe  3 2 1 0 0 1 2 3 

module Main where

import System.Environment

data T = A | B | C | D | E deriving (Enum)

{-# NOINLINE func #-}
func :: T -> Int
func A = 111
func B = 222
func C = 322
func D = 422

main = do
  args <- getArgs
  let iargs = map (toEnum . read) args :: [T]
  let res = sum . map func $ concat $ replicate 5000000 iargs
  print res

The change might also be underrepresented in nofib.

The speed of GHC itself doesn't change much (measured by building nofib/spectral/simple) 1)

O0O2O2 with weights disabled
head894 ms3.88 s3.89 s
patched896 ms3.91 s3.86 s
rel diff100,2%100,8%99,2%

There is a small overlap in the 95% confidence interval

Given that we always pay to price carrying the extra information around in all STG alts and parts of Cmm at least that is decent.


1 ) Besides NoFib I (ab)use criterion to benchmark the executables using (https://github.com/AndreasPK/bench-exec).

It dynamically creates a benchmark that runs the given executable/args until it reaches the given time limit.
While it causes a bit of overhead it's easier than summing them up by hand and reliable showcases difference as low as 1-2ms.
For the compile benchmarks I used 70 seconds which translated to 15 Runs for the -O2 case.

AndreasK edited the summary of this revision. (Show Details)Feb 22 2018, 5:48 AM

I've run nofib overnight and the results were a bit underwhelming. As it stands it's most likely not worth merging.

Andreas, this is good work. I'm inclined to agree that it's not worth adding all this extra code if the perf payoff is very small. That's disappointing.

But if you decide to park this, please add your long comment, with perf numbers, to the ticket. And probably commit your work to a wip branch of the main GHC repository, so that someone else could easily pick it up at a future date. (Add the branch id to the ticket of course, so it's discoverable.)

Thanks!

Simon

I've run nofib overnight and the results were a bit underwhelming. As it stands it's most likely not worth merging.

Andreas, this is good work. I'm inclined to agree that it's not worth adding all this extra code if the perf payoff is very small. That's disappointing.

But if you decide to park this, please add your long comment, with perf numbers, to the ticket. And probably commit your work to a wip branch of the main GHC repository, so that someone else could easily pick it up at a future date. (Add the branch id to the ticket of course, so it's discoverable.)

Thanks!

Simon

Maybe I can put some work into this for GSoC. Otherwise how would I go about getting this into a wip branch? I don't have commit rights.

Otherwise how would I go about getting this into a wip branch? I don't have commit rights

I guess you'd need help from someone on ghc-devs who does; e.g. Ben Gamari

I've pushed this patch to wip/D4327 .

AndreasK updated this revision to Diff 15751.EditedMar 17 2018, 4:31 AM
  • Optionally balance the switchplan tree by branch weight.
  • Rebase

In a discussion with alexbiehl about code being placed inside of loops I realized we could take advantage of the occurrence checker to find simple loops and mark looping branches as likely.

This has the effect that for code like:

entry:
	...
	if c then goto L1 else goto L2;
L2:
	<cleanup>
	<leave loop>
L1:
	<loop body>
	goto entry:

We can mark L2 as unlikely as it breaks out of the loop.

The consequence of this is that we can reliably move L2 out of the loop body and fall through to L1 improving code locality.

I will look into this and see if it makes a difference.

How is this going, @AndreasK?

AndreasK planned changes to this revision.Apr 22 2018, 4:53 PM

How is this going, @AndreasK?

I did some work on this. But so far the results still don't justify the complexity. Although it's pretty clear which things should be tried before a real judgment can be made.

I did implement a rudimentary version of the loop/recursion detection idea above.
Overall performance was underwhelming. While most nofib programs got slightly faster (0-1%) very few got a lot (order of ~5%) slower.

What I did not do yet is propagte the likelyhood info into the asm passes which is probably required. Without this we get some awkward interactions.
For example we might have code of the sort:

if checkError() then {goto panic} else {goto foo};

Assuming another block also jumps to foo there are two ways to compile this block:

check:
    cmp R1, 0
    jnz panic
    jmp foo
    
check2:
    cmp R1, 0
    jz foo
    jmp panic

Performance of jumps to panic is essentially meaningless so we want variant two which saves an instruction in each loop.

Blocklayout however turns this into a pessimization.

We often get something like this:

check:
    cmp R1, 0
    jnz panic
    jmp foo
bar: #<other block jumping to foo>
    few ins here
    #<shortcut>jmp foo
foo:
    ins do things
panic:
    some
    more
    instructions

This seems reasonable. If we are lucky bar is small enough that when we jump to foo we won't even miss the cache.

But if we optimize the check block we suddenly get:

check2:
    cmp R1, 0
    jz foo
    #<shortcut> jmp panic #block layout assumes we will take this path
panic:
    some
    more
    instructions
bar: #<other block jumping to foo>
    few ins here
    #<shortcut>jmp foo
foo:
    ins do things

Clearly we don't want this because it pulls check2 and foo far apart. But block layout assumes the last jump is the likely one so tries to place panic right after check2.
If this is a loop there is a good chance that this causes a cache miss on each jump to foo.

I did play around with the block layout code but without actually having likelyhood info I couldn't make it work better than what we have know.
I do want to give lowering likelyhood info into the asm stage a try sooner or (probably) later since I would expect that to lead to much better code..

However for now I ran out of steam before I came up with a good way to tackle this.
Some of the issues that stopped me where:

  • Good block layout algorithms are not easy. There are descriptions for algorithms which work with full information about block entry counts. But Implementing one which works well on partial information is something I couldn't find anything on yet. Rolling my own is not out of the question but seems like something for a few weeks and not a few days.
  • There is no obvious best design for lowering likelyhood information to the asm level. Annotate the instructions? Add information about blocks in a sidechannel? It's also not static since things like the register allocator also generate new blocks on the fly. Shortcutting might remove blocks and so on.
  • Compared to Cmm the assembler passes are not well documented.
  • Besides block layout if we lower that information it should also used by the register allocators. Which I haven't even looked at yet.
  • Much of what would profit most from this (block layouts, register allocation) might be redundant if we switch to llvm. So I'm also a bit hesistant since it seems not clear where (and how fast) llvm as a backend is going.

Lots of useful info in your comment. But I'm worried that Phab is not a good place to keep this info if you are going to "park" it for a while.

My suggestion:

  • Push what you have to wip/T14672 in GHC's rept
  • Write a status report (copy/paste from your comment above) in the Trac ticket.

That'll robustly store your progress for you or someone else to pick up later

AndreasK updated this revision to Diff 16208.Fri, Apr 27, 3:39 AM

This is a WIP patch. The main things it does are:

  • Add branch weights to Switchplan.
  • Mark bottom exprs as unlikely (funlikely-bottoms)
  • Add branch weight to switchplan
  • Optionally build the switchplan tree by branch weight.
  • Mark recursive cases as likely.

It works fine but as it stands doesn't improve performance in meaningful ways.
See Trac #14672 for a discussion about the issues left to resolve.

I'd be happy to see this land as a wip/ branch in the main GHC repo, so that anyone who picks up the ticket has a well-defined starting point.

@simonpj as an alternative to creating a wip/ branch, people can pick it up by saying arc patch D4327 in their GHC repo. It saves pushing anything to the main repo, the result is exactly the same.

AndreasK updated this revision to Diff 16336.Mon, May 7, 2:56 PM

This my latest changeset for likelyhood in GHC.

  • Add BranchWeight to Switchplan.
  • Make unlikely bottoms a flag
  • Stop swapping branches in ImplementSwitchplan
  • Add branch weight to switchplan
  • FindSingleValue checks for likely branches.
  • Optionally build the switchplan tree by branch weight.
  • Mark recursive cases as likely.

It should compile with some harmless warnings.