DO NOT MERGE - Experimental code layout
Needs ReviewPublic

Authored by AndreasK on May 23 2018, 3:57 PM.

Details

Reviewers
bgamari
jmct
jrtc27
simonmar
Trac Issues
#15124
Summary

This patch implements a different experimental code layout.
It has only been finished/tested for amd64.

Performance is slightly better (but within the margin of error)
on my machine. However compile times are definitely worse so
until there is something else which can take advantage of this
it's not worth implementing this fully.

Things this patch does:

  • Move code responsilbe for block layout in it's own Module.
  • Move the NcgImpl Class into NCGMonad.
  • Extract a control flow graph from the input cmm.
  • Update this cfg to keep it in sync with changes during asm codegen. I've only did the parts used by amd64 so this is definitely broken for x86 and likely broken on the other backends. Things that change the CFG are::
    • Blocks added by the linear register allocator.
    • Blocks added by the Cmm -> [Instr] translation.
    • Shortcutting.
  • Assign weights to the edges in the CFG which are then used for block layout.
Test Plan

ci

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
layoutOpt
Lint
Lint WarningsExcuse: better layout
SeverityLocationCodeMessage
Warningcompiler\cmm\CmmNode.hs:575TXT3Line Too Long
Warningcompiler\nativeGen\AsmCodeGen.hs:253TXT3Line Too Long
Warningcompiler\nativeGen\AsmCodeGen.hs:595TXT3Line Too Long
Warningcompiler\nativeGen\AsmCodeGen.hs:776TXT3Line Too Long
Warningcompiler\nativeGen\CFG.hs:400TXT3Line Too Long
Warningcompiler\nativeGen\CFG.hs:509TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:74TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:79TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:80TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:82TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:83TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:85TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:91TXT3Line Too Long
Warningcompiler\nativeGen\RegAlloc\Liveness.hs:668TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:1974TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:2152TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:3273TXT3Line Too Long
Warningcompiler\nativeGen\X86\Cond.hs:73TXT3Line Too Long
Warningcompiler\nativeGen\X86\Cond.hs:75TXT3Line Too Long
Warningcompiler\nativeGen\X86\Cond.hs:76TXT3Line Too Long
Warningcompiler\utils\Digraph.hs:486TXT3Line Too Long
Warningcompiler\utils\Digraph.hs:494TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 22282
Build 51673: [GHC] Linux/amd64: Continuous Integration
Build 51672: [GHC] OSX/amd64: Continuous Integration
Build 51671: [GHC] Windows/amd64: Continuous Integration
Build 51670: arc lint + arc unit
AndreasK created this revision.May 23 2018, 3:57 PM
AndreasK planned changes to this revision.May 23 2018, 6:24 PM

As it stands performance is about the same but compile times are slightly worse.

There are still some knobs to turn here and it should also be more amendable to take advantage of control flow information from eg profiling.
Hopefully I will be able to base some other work on it later.

AndreasK updated this revision to Diff 16519.May 24 2018, 2:49 AM

Some adjustments to the weights.

I've run nofib this night and results actually look quite good after a bit of tweaking!

IgnoreCal is this patch. Likely04 is this patch on top of D4327.

---------------------------------------------------------------------------------------------------------------------------
        Program           Size      Size    Allocs    Allocs   Runtime   Runtime   Elapsed   Elapsed  TotalMem  TotalMem
                     IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04
---------------------------------------------------------------------------------------------------------------------------
            Min          -0.1%     -0.1%     -0.0%      0.0%     -3.1%     -5.3%     -3.0%     -6.2%     -1.7%     -0.7%
            Max          +0.0%     +0.0%      0.0%     +0.0%     +1.6%     +7.3%     +1.6%     +7.0%     +0.5%     +0.8%
 Geometric Mean          -0.0%     +0.0%     -0.0%     -0.0%     -0.3%     -1.1%     -0.5%     -1.2%     -0.0%     +0.0%

However I want to get this tested on at least 1-2 other CPUs before I put more work into getting this ready to land.

Full nofib log:

AndreasK updated this revision to Diff 16569.May 29 2018, 6:20 AM
  • Only look at reachable graphs when constructing the CFG.
  • Import cleanup.

cmmToCmm finds some dead blocks and removes conditional jumps to them.
However they are still members of the map used by CmmGraph.

By using revPostOrder we avoid adding these dead blocks to the CFG.

AndreasK updated this revision to Diff 16591.May 30 2018, 4:50 PM
  • Performance improvements
AndreasK updated this revision to Diff 16626.Jun 1 2018, 5:16 AM
  • Trim edges in chain linking passes.
AndreasK updated this revision to Diff 17038.Jun 21 2018, 1:00 PM
  • Rebase.

I've run nofib on a Xeon E3-1220 which jmct has provided me for a bench build and the results seem to hold up.

  • loglayout is just this patch
  • loglikely is D4327
  • likeAndLa is this + D4327
NoFib Results

--------------------------------------------------------------------------------
        Program           Size      Size      Size    Allocs    Allocs    Allocs   Runtime   Runtime   Runtime   Elapsed   Elapsed   Elapsed  TotalMem  TotalMem  TotalMem
                     loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa
--------------------------------------------------------------------------------
             CS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.195     0.178     0.191     0.195     0.178     0.192      0.0%      0.0%      0.0%
            CSD          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.6%     -5.5%     -4.3%     +1.6%     -5.4%     -4.3%      0.0%      0.0%      0.0%
             FS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +4.3%     -3.9%     -5.6%     +4.3%     -3.9%     -5.6%      0.0%      0.0%      0.0%
              S          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.2%     -2.2%     -1.5%     -2.2%     -2.3%     -1.6%      0.0%      0.0%      0.0%
             VS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.7%     -1.8%     -1.8%     -2.7%     -1.8%     -1.8%      0.0%      0.0%      0.0%
            VSD          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.009     0.009     0.006     0.009     0.009      0.0%      0.0%      0.0%
            VSM          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%    -22.5%    -28.6%    -28.0%    -22.6%    -28.6%    -28.0%      0.0%      0.0%      0.0%
           anna          -0.2%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.067     0.067     0.068     0.067     0.067     0.068      0.0%      0.0%      0.0%
           ansi          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
           atom          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.187     0.191     0.191     0.187     0.191     0.190      0.0%      0.0%      0.0%
         awards          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         banner          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
     bernouilli          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.113     0.111     0.110     0.113     0.111     0.110      0.0%      0.0%      0.0%
   binary-trees          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.1%     +4.9%     -1.5%     +1.1%     +4.9%     -1.5%      0.0%      0.0%      0.0%
          boyer          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.026     0.025     0.025     0.026     0.025     0.025      0.0%      0.0%      0.0%
         boyer2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
           bspt          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
      cacheprof          -0.1%     +0.0%     -0.1%     -0.1%     -0.1%     -0.0%     -0.4%     -4.5%     -2.3%     -0.4%     -4.5%     -2.3%     -0.5%     -0.5%     +0.5%
       calendar          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
       cichelli          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.054     0.050     0.050     0.054     0.050     0.050      0.0%      0.0%      0.0%
        circsim          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +0.8%     +0.0%     +0.6%     +0.8%     -0.0%     +0.6%      0.0%      0.0%      0.0%
       clausify          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.024     0.022     0.021     0.024     0.022     0.021      0.0%      0.0%      0.0%
  comp_lab_zift          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.121     0.120     0.118     0.121     0.120     0.118      0.0%      0.0%      0.0%
       compress          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.087     0.085     0.084     0.087     0.085     0.084      0.0%      0.0%      0.0%
      compress2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.091     0.091     0.090     0.091     0.091     0.090      0.0%      0.0%      0.0%
    constraints          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.2%     -0.5%     -0.9%     -0.2%     -0.5%     -0.9%      0.0%      0.0%      0.0%
   cryptarithm1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.4%     -6.4%     -6.0%     -2.4%     -6.4%     -6.0%      0.0%      0.0%      0.0%
   cryptarithm2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
            cse          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
   digits-of-e1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -1.6%     -7.0%     -7.8%     -1.6%     -7.1%     -7.8%      0.0%      0.0%      0.0%
   digits-of-e2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.9%     -8.3%     -7.8%     -0.9%     -8.3%     -7.8%      0.0%      0.0%      0.0%
          eliza          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
          event          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.100     0.097     0.098     0.100     0.097     0.098      0.0%      0.0%      0.0%
    exact-reals          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.3%     -1.6%     -2.6%     +0.3%     -1.6%     -2.6%      0.0%      0.0%      0.0%
         exp3_8          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.146     0.146     0.146     0.146     0.146     0.146      0.0%      0.0%      0.0%
         expert          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
 fannkuch-redux          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -1.8%     -3.1%     -3.8%     -1.8%     -3.1%     -3.8%      0.0%      0.0%      0.0%
          fasta          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -1.3%     -1.6%     -1.4%     -1.3%     -1.6%     -1.4%      0.0%      0.0%      0.0%
            fem          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.013     0.012     0.013     0.013     0.012     0.013      0.0%      0.0%      0.0%
            fft          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.022     0.022     0.022     0.022     0.022     0.022      0.0%      0.0%      0.0%
           fft2          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.032     0.032     0.032     0.032     0.032     0.032      0.0%      0.0%      0.0%
       fibheaps          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.015     0.015     0.015     0.015     0.015     0.015      0.0%      0.0%      0.0%
           fish          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.006     0.006     0.006     0.006     0.006      0.0%      0.0%      0.0%
          fluid          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.005     0.004     0.004     0.005     0.004     0.004      0.0%      0.0%      0.0%
         fulsom          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.180     0.180     0.180     0.180     0.180     0.180      0.0%      0.0%      0.0%
         gamteb          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.025     0.025     0.025     0.025     0.025     0.025      0.0%      0.0%      0.0%
            gcd          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.027     0.026     0.026     0.027     0.026     0.026      0.0%      0.0%      0.0%
    gen_regexps          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         genfft          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.020     0.021     0.020     0.020     0.021     0.020      0.0%      0.0%      0.0%
             gg          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
           grep          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         hidden          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +4.8%     +0.0%     +0.5%     +5.0%     -0.1%     +0.7%      0.0%      0.0%      0.0%
            hpg          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.115     0.118     0.115     0.115     0.118     0.115      0.0%      0.0%      0.0%
            ida          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.057     0.055     0.056     0.057     0.055     0.056      0.0%      0.0%      0.0%
          infer          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.033     0.033     0.033     0.033     0.033     0.033      0.0%      0.0%      0.0%
        integer          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.2%     -3.4%     -0.1%     -0.2%     -3.3%     -0.1%      0.0%      0.0%      0.0%
      integrate          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.093     0.087     0.083     0.093     0.087     0.083      0.0%      0.0%      0.0%
   k-nucleotide          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +1.3%     -1.1%     -2.3%     +1.3%     -1.1%     -2.3%      0.0%      0.0%      0.0%
          kahan          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +1.2%     +0.2%     +0.6%     +1.2%     +0.2%     +0.6%      0.0%      0.0%      0.0%
        knights          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.003     0.002     0.002     0.003     0.002     0.002      0.0%      0.0%      0.0%
         lambda          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -2.7%     -5.4%     -4.3%     -2.7%     -5.4%     -4.3%      0.0%      0.0%      0.0%
     last-piece          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.6%     -0.2%     -1.0%     +0.6%     -0.3%     -1.0%      0.0%      0.0%      0.0%
           lcss          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.6%     -0.8%     -1.0%     -0.6%     -0.9%     -1.0%      0.0%      0.0%      0.0%
           life          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.155     0.155     0.154     0.155     0.155     0.154      0.0%      0.0%      0.0%
           lift          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
         linear          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.3%     -1.5%     -0.9%     -0.3%     -1.5%     -0.9%      0.0%      0.0%      0.0%
      listcompr          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.064     0.062     0.062     0.064     0.062     0.062      0.0%      0.0%      0.0%
       listcopy          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.067     0.066     0.066     0.067     0.066     0.066      0.0%      0.0%      0.0%
       maillist          -0.1%     +0.0%     -0.0%      0.0%     +0.0%     +0.0%     0.042     0.042     0.042     0.043     0.043     0.043    +16.7%    +36.7%    +40.0%
         mandel          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.045     0.043     0.043     0.045     0.043     0.043      0.0%      0.0%      0.0%
        mandel2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.002     0.002     0.002     0.002     0.002     0.002      0.0%      0.0%      0.0%
           mate          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -1.2%     -6.4%     -4.1%     -1.2%     -6.4%     -4.1%      0.0%      0.0%      0.0%
        minimax          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
        mkhprog          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
     multiplier          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.060     0.062     0.063     0.060     0.062     0.063      0.0%      0.0%      0.0%
         n-body          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.0%     +0.0%     -0.0%     -0.0%     +0.0%     -0.1%      0.0%      0.0%      0.0%
       nucleic2          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.048     0.052     0.049     0.048     0.052     0.049      0.0%      0.0%      0.0%
           para          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.188     0.191     0.186     0.188     0.191     0.186      0.0%      0.0%      0.0%
      paraffins          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.079     0.078     0.078     0.079     0.078     0.078      0.0%      0.0%      0.0%
         parser          -0.1%     -0.0%     -0.1%      0.0%      0.0%      0.0%     0.019     0.018     0.018     0.019     0.018     0.018      0.0%      0.0%      0.0%
        parstof          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.003     0.003     0.003     0.003     0.003     0.003      0.0%      0.0%      0.0%
            pic          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.004     0.004     0.004     0.004     0.004     0.004      0.0%      0.0%      0.0%
       pidigits          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.1%     -0.2%     -0.3%     -0.3%     -0.1%     -0.3%      0.0%      0.0%      0.0%
          power          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.3%     -2.7%     -1.4%     -0.2%     -2.7%     -1.4%      0.0%      0.0%      0.0%
         pretty          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         primes          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.045     0.044     0.045     0.045     0.044     0.045      0.0%      0.0%      0.0%
      primetest          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.074     0.072     0.071     0.074     0.072     0.071      0.0%      0.0%      0.0%
         prolog          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
         puzzle          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.082     0.081     0.081     0.082     0.081     0.081      0.0%      0.0%      0.0%
         queens          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.008     0.008     0.008     0.008     0.008     0.008      0.0%      0.0%      0.0%
        reptile          -0.1%      0.0%     -0.1%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
reverse-complem          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.077     0.074     0.073     0.077     0.074     0.073      0.0%      0.0%      0.0%
        rewrite          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.012     0.011     0.011     0.012     0.011     0.011      0.0%      0.0%      0.0%
           rfib          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.010     0.010     0.010     0.010     0.010     0.010      0.0%      0.0%      0.0%
            rsa          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.016     0.016     0.016     0.016     0.016     0.015      0.0%      0.0%      0.0%
            scc          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
          sched          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.013     0.012     0.012     0.013     0.012     0.012      0.0%      0.0%      0.0%
            scs          -0.2%     -0.0%     -0.1%      0.0%      0.0%      0.0%     +0.3%     -3.5%     -2.5%     +0.2%     -3.5%     -2.6%      0.0%      0.0%      0.0%
         simple          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.122     0.126     0.123     0.122     0.126     0.123      0.0%      0.0%      0.0%
          solid          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.085     0.083     0.082     0.085     0.083     0.082      0.0%      0.0%      0.0%
        sorting          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
  spectral-norm          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%      0.0%     -0.0%     -0.0%     +0.0%     -0.0%     -0.0%      0.0%      0.0%      0.0%
         sphere          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.031     0.031     0.031     0.031     0.031     0.031      0.0%      0.0%      0.0%
         symalg          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.006     0.006     0.006     0.006     0.006      0.0%      0.0%      0.0%
            tak          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
      transform          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.6%     -3.1%     -2.5%     +0.6%     -3.1%     -2.5%      0.0%      0.0%      0.0%
       treejoin          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.090     0.090     0.090     0.090     0.090     0.090      0.0%      0.0%      0.0%
      typecheck          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.155     0.154     0.154     0.155     0.154     0.155      0.0%      0.0%      0.0%
        veritas          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
           wang          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.066     0.065     0.063     0.066     0.065     0.063      0.0%      0.0%      0.0%
      wave4main          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.180     0.178     0.179     0.180     0.178     0.179      0.0%      0.0%      0.0%
   wheel-sieve1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.1%     -0.7%     +0.7%     +1.1%     -0.7%     +0.7%      0.0%      0.0%      0.0%
   wheel-sieve2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.133     0.133     0.134     0.133     0.133     0.134      0.0%      0.0%      0.0%
           x2n1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.002     0.002     0.002     0.002     0.002     0.002      0.0%      0.0%      0.0%
--------------------------------------------------------------------------------
            Min          -0.2%     -0.0%     -0.1%     -0.1%     -0.1%     -0.0%    -22.5%    -28.6%    -28.0%    -22.6%    -28.6%    -28.0%     -0.5%     -0.5%      0.0%
            Max          -0.1%     +0.0%     -0.0%      0.0%     +0.0%     +0.0%     +4.8%     +4.9%     +0.7%     +5.0%     +4.9%     +0.7%    +16.7%    +36.7%    +40.0%
 Geometric Mean          -0.1%     +0.0%     -0.1%     -0.0%     -0.0%     -0.0%     -0.9%     -3.4%     -3.2%     -0.9%     -3.4%     -3.2%     +0.1%     +0.3%     +0.3%

The only benchmark where memory differed is maillist which I attribute to it's undeterministic GC behaviour.
That the likelyhood patch alone performs so well is a bit surprising given that this isn't the case on skylake.

I will rerun the benchmark to make sure these results are sane but it's good to see that it's still a win after rebasing and on a different CPU.

jmct added a comment.Jun 27 2018, 11:02 PM

Here are the results on the Ryzen 7 I am able to access. I ran it on a quiet headless machine with NoFibRuns=30. The results differ a bit. It could be noise though.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
             CS          -0.2%      0.0%     0.150     0.150      0.0%
            CSD          -0.3%      0.0%    -11.5%    -11.5%      0.0%
             FS          -0.2%      0.0%     -1.2%     -1.2%      0.0%
              S          -0.3%      0.0%     +0.8%     +0.8%      0.0%
             VS          -0.3%      0.0%     -0.4%     -0.3%      0.0%
            VSD          -0.3%      0.0%     0.008     0.008      0.0%
            VSM          -0.2%      0.0%    -12.9%    -12.9%      0.0%
           anna          -0.5%      0.0%     0.070     0.070      0.0%
           ansi          -0.4%      0.0%     0.000     0.000      0.0%
           atom          -0.4%      0.0%     0.133     0.133      0.0%
         awards          -0.3%      0.0%     0.000     0.000      0.0%
         banner          -0.2%      0.0%     0.000     0.000      0.0%
     bernouilli          -0.4%      0.0%     0.095     0.095      0.0%
   binary-trees          -0.4%      0.0%     +4.5%     +4.5%      0.0%
          boyer          -0.3%      0.0%     0.027     0.027      0.0%
         boyer2          -0.3%      0.0%     0.005     0.005      0.0%
           bspt          -0.3%      0.0%     0.005     0.005      0.0%
      cacheprof          -0.4%     +0.1%     -1.5%     -1.5%     -0.9%
       calendar          -0.3%      0.0%     0.000     0.000      0.0%
       cichelli          -0.3%      0.0%     0.055     0.055      0.0%
        circsim          -0.4%      0.0%     +5.0%     +5.0%      0.0%
       clausify          -0.3%      0.0%     0.023     0.023      0.0%
  comp_lab_zift          -0.4%      0.0%     0.107     0.107      0.0%
       compress          -0.3%      0.0%     0.074     0.074      0.0%
      compress2          -0.2%      0.0%     0.078     0.078      0.0%
    constraints          -0.3%      0.0%     -0.9%     -0.9%      0.0%
   cryptarithm1          -0.3%      0.0%     +1.7%     +1.7%      0.0%
   cryptarithm2          -0.3%      0.0%     0.006     0.006      0.0%
            cse          -0.3%      0.0%     0.001     0.001      0.0%
   digits-of-e1          -0.4%      0.0%     -0.2%     -0.2%      0.0%
   digits-of-e2          -0.4%      0.0%     +1.7%     +1.7%      0.0%
          eliza          -0.3%      0.0%     0.000     0.000      0.0%
          event          -0.3%      0.0%     0.091     0.091      0.0%
    exact-reals          -0.4%      0.0%     -0.1%     -0.2%      0.0%
         exp3_8          -0.3%      0.0%     0.097     0.097      0.0%
         expert          -0.3%      0.0%     0.000     0.000      0.0%
 fannkuch-redux          -0.3%      0.0%     -2.6%     -2.6%      0.0%
          fasta          -0.3%      0.0%     0.168     0.168      0.0%
            fem          -0.4%      0.0%     0.012     0.012      0.0%
            fft          -0.4%      0.0%     0.020     0.021      0.0%
           fft2          -0.4%      0.0%     0.027     0.027      0.0%
       fibheaps          -0.3%      0.0%     0.015     0.015      0.0%
           fish          -0.3%      0.0%     0.006     0.006      0.0%
          fluid          -0.5%      0.0%     0.005     0.005      0.0%
         fulsom          -0.4%      0.0%     0.152     0.152      0.0%
         gamteb          -0.4%      0.0%     0.025     0.025      0.0%
            gcd          -0.3%      0.0%     0.026     0.026      0.0%
    gen_regexps          -0.3%      0.0%     0.000     0.000      0.0%
         genfft          -0.3%      0.0%     0.021     0.021      0.0%
             gg          -0.4%      0.0%     0.007     0.007      0.0%
           grep          -0.3%      0.0%     0.000     0.000      0.0%
         hidden          -0.4%      0.0%     -0.1%     -0.1%      0.0%
            hpg          -0.4%      0.0%     0.058     0.058      0.0%
            ida          -0.3%      0.0%     0.058     0.058      0.0%
          infer          -0.5%      0.0%     0.034     0.034      0.0%
        integer          -0.3%      0.0%     -1.6%     -1.6%      0.0%
      integrate          -0.3%      0.0%     0.066     0.066      0.0%
   k-nucleotide          -0.4%     +0.0%     -8.7%     -8.7%      0.0%
          kahan          -0.3%      0.0%     0.175     0.175      0.0%
        knights          -0.3%      0.0%     0.003     0.003      0.0%
         lambda          -0.3%      0.0%     -1.5%     -1.5%      0.0%
     last-piece          -0.3%      0.0%     +3.3%     +3.3%      0.0%
           lcss          -0.3%      0.0%     -0.4%     -0.4%      0.0%
           life          -0.3%      0.0%     0.131     0.131      0.0%
           lift          -0.3%      0.0%     0.001     0.001      0.0%
         linear          -0.4%      0.0%     -1.2%     -1.2%      0.0%
      listcompr          -0.2%      0.0%     0.058     0.059      0.0%
       listcopy          -0.2%      0.0%     0.061     0.062      0.0%
       maillist          -0.3%     -0.0%     0.032     0.032    +10.5%
         mandel          -0.4%      0.0%     0.043     0.043      0.0%
        mandel2          -0.3%      0.0%     0.002     0.002      0.0%
           mate          -0.4%      0.0%     -2.9%     -2.9%      0.0%
        minimax          -0.3%      0.0%     0.002     0.002      0.0%
        mkhprog          -0.3%      0.0%     0.001     0.001      0.0%
     multiplier          -0.3%      0.0%     0.069     0.069      0.0%
         n-body          -0.4%      0.0%     +0.1%     +0.1%      0.0%
       nucleic2          -0.4%      0.0%     0.054     0.054      0.0%
           para          -0.3%      0.0%     0.172     0.172      0.0%
      paraffins          -0.3%      0.0%     0.067     0.068      0.0%
         parser          -0.4%      0.0%     0.018     0.018      0.0%
        parstof          -0.3%      0.0%     0.004     0.004      0.0%
            pic          -0.4%      0.0%     0.003     0.003      0.0%
       pidigits          -0.4%      0.0%     +0.7%     +0.8%      0.0%
          power          -0.4%      0.0%     0.190     0.191      0.0%
         pretty          -0.3%      0.0%     0.000     0.000      0.0%
         primes          -0.3%      0.0%     0.033     0.033      0.0%
      primetest          -0.4%      0.0%     0.060     0.060      0.0%
         prolog          -0.3%      0.0%     0.001     0.001      0.0%
         puzzle          -0.2%      0.0%     0.098     0.098      0.0%
         queens          -0.3%      0.0%     0.008     0.008      0.0%
        reptile          -0.4%      0.0%     0.007     0.007      0.0%
reverse-complem          -0.3%      0.0%     0.076     0.076      0.0%
        rewrite          -0.3%      0.0%     0.014     0.014      0.0%
           rfib          -0.4%      0.0%     0.011     0.011      0.0%
            rsa          -0.4%      0.0%     0.016     0.016      0.0%
            scc          -0.3%      0.0%     0.000     0.000      0.0%
          sched          -0.3%      0.0%     0.015     0.016      0.0%
            scs          -0.6%      0.0%     +0.4%     +0.3%      0.0%
         simple          -0.5%      0.0%     0.099     0.100      0.0%
          solid          -0.4%      0.0%     0.079     0.079      0.0%
        sorting          -0.3%      0.0%     0.001     0.001      0.0%
  spectral-norm          -0.4%      0.0%     -0.3%     -0.3%      0.0%
         sphere          -0.4%      0.0%     0.034     0.034      0.0%
         symalg          -0.4%      0.0%     0.005     0.005      0.0%
            tak          -0.3%      0.0%     0.008     0.008      0.0%
      transform          -0.4%      0.0%     0.192     0.192      0.0%
       treejoin          -0.3%      0.0%     0.077     0.077      0.0%
      typecheck          -0.3%      0.0%     0.156     0.156      0.0%
        veritas          -0.4%      0.0%     0.001     0.001      0.0%
           wang          -0.4%      0.0%     0.058     0.058      0.0%
      wave4main          -0.3%      0.0%     0.174     0.175      0.0%
   wheel-sieve1          -0.3%      0.0%     0.129     0.129      0.0%
   wheel-sieve2          -0.3%      0.0%     0.102     0.103      0.0%
           x2n1          -0.4%      0.0%     0.001     0.001      0.0%
--------------------------------------------------------------------------------
            Min          -0.6%     -0.0%    -12.9%    -12.9%     -0.9%
            Max          -0.2%     +0.1%     +5.0%     +5.0%    +10.5%
 Geometric Mean          -0.3%     +0.0%     -1.2%     -1.2%     +0.1%
In D4726#135105, @jmct wrote:

Here are the results on the Ryzen 7 I am able to access. I ran it on a quiet headless machine with NoFibRuns=30. The results differ a bit. It could be noise though.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
            Min          -0.6%     -0.0%    -12.9%    -12.9%     -0.9%
            Max          -0.2%     +0.1%     +5.0%     +5.0%    +10.5%
 Geometric Mean          -0.3%     +0.0%     -1.2%     -1.2%     +0.1%

So currently we measured:

  • -0.3% on Skylake
  • -0.9% on Xeon E3-1220
  • -1.2% for Ryzen.

I would say this seems workable!

I also have two more ideas I want to look into which might improve things further.

In particular:

  • We could run a basic loop detection algorithm over the CFG and bump the priority of backedges to make sure loop bodies are layout out sequentially.
  • Currently it's somewhat arbitrary which edge is first processed if two have the same priority. Maybe we could get some performance there as well.
jmct added a comment.EditedJun 30 2018, 8:29 PM

This is cool work! I have a few questions,

Currently, the weights in the control flow graph, lines 242-251 of compiler/nativeGen/CFG.hs, here:

compiler/nativeGen/CFG.hs
  --Penalize info tables since they prevent eliminating
  --the jump
  | mapMember dest info -> [((bid,dest),60)]
  | otherwise           -> [((bid,dest),100)]
CmmCondBranch _c t f l 
  -- Prefer false branch to keep in line with old
  -- layout algorithm.
  | l == Nothing -> [((bid,f),51),((bid,t),49)]
  | l == Just True -> [((bid,f),10),((bid,t),90)]
  | l == Just False -> [((bid,t),10),((bid,f),90)]

What's the justification of the different weights? I can see the intuition, but I don't have a good understanding of how changing those weights would affect the results of the algorithm. Direct jumps being 100 makes intuitive sense, but what about the others? For example, why 10/90 and not 20/80?

Another thought/concern:

The performance differences between the architectures is interesting. My first guess was that maybe it was 'more cache == better performance', and so I looked that the available cache on the Xeon and the Ryzen and got the following (I think this is mostly an L1 cache, thing, but I've included all the cache lines just in case):

Xeon
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
Ryzen
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K

So at first glance this would support my hypothesis, since the Ryzen has double the L1i cache of the Xeon. But what does the Skylake look like? I would like to have a better understanding of what is causing the differences in the performance numbers. If the new code layout is rarely a performance regression, but gets more beneficial as the L1i cache scales it'd be worth getting in ghc-mainline (IMO). Another experiment might be worth trying it on a machine with very little cache and see how that goes.

As for your two paths forward, I think the first one seems more immediately beneficial, especially since a lot of performance critical code ends up having an inner loop.

It would be nice for you to link to the wiki page you made in the note [Chain based CFG serialization]. Your blog post and the wiki page were really helpful in understanding what was going on, so thank you for that! Let's make sure future GHC devs can find them :D

In D4726#135331, @jmct wrote:

This is cool work! I have a few questions,

What's the justification of the different weights? I can see the intuition, but I don't have a good understanding of how changing those weights would affect the results of the algorithm. Direct jumps being 100 makes intuitive sense, but what about the others? For example, why 10/90 and not 20/80?

The numbers currently only matter relative to each other.
If we branch a->b, a->c what matters is if edgeWeight a b > edgeWeight a c.
If the difference is 1 or 100 currently doesn't factor into it.

why 10/90 and not 20/80

Currently when we have a branching block with likely annotation in 99% of all cases it will be a heap/stack check or loop.
Both of which would be better described as "once in a blue moon" than unlikely.
Prefering to keep with with multiples of 10 I chose 10 instead of 1 so arrived at 10/90.

The numbers currently don't represent a cost model, just a more is better ranking.
This currently works fine. We don't have real data on how likely branches really are anyway.
So that leaves us with stack/heap checks, info tables and loops were we can say something. And for these the relative ordering is good enough.

I did write it using weights instead of a strict ordering so that if we ever profile branches or similar
we can just rethink the weight assignment and the algorithm itself will still work.

But you are right this isn't documented and I will look to add more info there.

Another thought/concern:

The performance differences between the architectures is interesting. My first guess was that maybe it was 'more cache == better performance', and so I looked that the available cache on the Xeon and the Ryzen and got the following (I think this is mostly an L1 cache, thing, but I've included all the cache lines just in case):

Xeon
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
Ryzen
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K

So at first glance this would support my hypothesis, since the Ryzen has double the L1i cache of the Xeon. But what does the Skylake look like? I would like to have a better understanding of what is causing the differences in the performance numbers. If the new code layout is rarely a performance regression, but gets more beneficial as the L1i cache scales it'd be worth getting in ghc-mainline (IMO). Another experiment might be worth trying it on a machine with very little cache and see how that goes.

Skylake has the same cache sizes as the Xeon (And since the Xeon is Sandy-Bridge based it's pretty similar in general).
I've also run benchmarks on bens Devil's Canyon(~Haswell) and the results were about the same as for the skylake but with more noise so I didn't include them here.

I find it highly unlikely that improvements will scale with larger cache.
Only looking at cache this patch improves the hit rate. But so does a larger cache!
If anything I would expect the difference to get smaller as cache sizes grow.

As for the difference between the Intels maybe skylakes larger reorder buffer
reduces the impact of I-Cache misses. My skylake also uses faster DDR4 so that might also be a factor.
In the end the results are not THAT different.

I assume the Intel/AMD difference stems from differences in their microarchitecture.

  • Different pretch algorithms
  • The ryzen has not only a larger instruction cache but also a larger uOp cache.
  • Ryzen has a slightly larger delay when reading from L2.
  • And a lot more which don't seem immediately relevant.

Not sure if it's relevant but I also measured the skylake under windows.
While I don't expect this to make a big difference it could be a contributing factor.

Overall they are close enough that I won't worry about it too much.
For a runtime measurement taken by two people measuring over 4 systems, two Oses and 4 CPU's it's very consistent imo.

As for your two paths forward, I think the first one seems more immediately beneficial, especially since a lot of performance critical code ends up having an inner loop.

Agreed.

It would be nice for you to link to the wiki page you made in the note [Chain based CFG serialization]. Your blog post and the wiki page were really helpful in understanding what was going on, so thank you for that! Let's make sure future GHC devs can find them :D

Good idea.

AndreasK updated this revision to Diff 17282.Jul 12 2018, 10:20 AM
  • Trim edges in chain linking passes.
  • Explain branch weights, add backedge detection.
  • Refactor where we look for loops
  • Refactor some aspects of the layout algorithm and code.
  • Change CFG weights, refactor CFG creation slightly.
  • Ignore info tables
  • Use edge weights for old algorithm too
AndreasK updated this revision to Diff 17333.Jul 15 2018, 3:12 AM
  • Allow to dynamically specify cfg weights for benchmarking (temporary)

    This allows me to easily benchmark a wider array of configurations.
jmct added a comment.EditedMon, Jul 30, 10:00 AM

The one question I have remaining is whether using unsafeGlobalDynFlags. I've left a note on the line in question to that effect.

Otherwise, just a bit of comment and TODO cleanup (which I think you've already done on your local branch during our call), getting it past Harbormaster, and then it LGTM.

compiler/nativeGen/CFG.hs
380

Either the flags should be passed in, or (if passing them in turns out to be costly), make a note about why this decision was made.

AndreasK updated this revision to Diff 17518.Mon, Jul 30, 10:25 AM
  • Only print nodes not covered by edges explicitly
  • Small improvements from jmcts review
  • Assert include fixup
  • Remove unused bindings
AndreasK updated this revision to Diff 17659.Tue, Aug 14, 9:27 AM
  • Be more aggresive in removing dead code.
  • Add last2 to get last two elements from list efficiently
  • Invert cond branch jump order at the asm level in order to eliminate one.
  • Decouple conditional branch weights from cmm branch order..
AndreasK updated this revision to Diff 17672.Wed, Aug 15, 7:24 AM
  • Remove unused parameters/imports.