NCG: New code layout algorithm.
Needs ReviewPublic

Authored by AndreasK on May 23 2018, 3:57 PM.

Details

Reviewers
bgamari
jmct
jrtc27
simonmar
Trac Issues
#15124
Summary

This patch implements a different experimental code layout.
It has only been finished/tested for amd64 hence is disabled on other platforms.

Performance varies slightly be CPU/Machine but in general seems to be better
by 1-2%.
While nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.

In an effort to avoid accidental regressions benchmarks were run on 3 platforms:

  • Sandy Bridge Xeon/Linux
  • Haswell/Linux
  • Skylake/Windows

Although I had to skip some benchmarks on some platforms as getting them
to build was too much work.

  • Library benchmark results summarized:
    • containers: ~1.5% faster
    • aeson: ~2% faster
    • megaparsec: ~2-5% faster
    • xml library benchmarks: 0.2%-1.1% faster
    • vector-benchmarks: 1-4% faster
    • text: 5.5% faster

On average GHC compile times over all of nofib also went down slightly by about half a percent.

Things this patch does:

  • Move code responsilbe for block layout in it's own Module.
  • Move the NcgImpl Class into NCGMonad.
  • Extract a control flow graph from the input cmm.
  • Update this cfg to keep it in sync with changes during asm codegen. This has been tested on x64 but should work on x86. Other platforms still use the old codelayout.
  • Assign weights to the edges in the CFG which are then used for block layout.
  • Once we have the final code layout try to eliminate some redundant jumps. In particular turn a sequences of: jne .foo jmp .bar foo: into je bar foo: ..
Test Plan

ci

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Branch
layoutOpt
Lint
Lint WarningsExcuse: looks better
SeverityLocationCodeMessage
Warningcompiler\nativeGen\AsmCodeGen.hs:599TXT3Line Too Long
Warningcompiler\nativeGen\AsmCodeGen.hs:781TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:262TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:331TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:334TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:414TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:416TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:421TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:422TXT3Line Too Long
Warningcompiler\nativeGen\BlockLayout.hs:435TXT3Line Too Long
Warningcompiler\nativeGen\CFG.hs:357TXT3Line Too Long
Warningcompiler\nativeGen\CFG.hs:405TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:74TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:79TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:80TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:82TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:83TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:85TXT3Line Too Long
Warningcompiler\nativeGen\NCGMonad.hs:91TXT3Line Too Long
Warningcompiler\nativeGen\RegAlloc\Liveness.hs:668TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:1986TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:2164TXT3Line Too Long
Warningcompiler\nativeGen\X86\CodeGen.hs:3337TXT3Line Too Long
Unit
No Unit Test Coverage
Build Status
Buildable 22858
Build 53787: [GHC] Linux/amd64: Continuous Integration
Build 53786: [GHC] OSX/amd64: Continuous Integration
Build 53785: [GHC] Windows/amd64: Continuous Integration
Build 53784: arc lint + arc unit
There are a very large number of changes, so older changes are hidden. Show Older Changes
AndreasK planned changes to this revision.May 23 2018, 6:24 PM

As it stands performance is about the same but compile times are slightly worse.

There are still some knobs to turn here and it should also be more amendable to take advantage of control flow information from eg profiling.
Hopefully I will be able to base some other work on it later.

AndreasK updated this revision to Diff 16519.May 24 2018, 2:49 AM

Some adjustments to the weights.

I've run nofib this night and results actually look quite good after a bit of tweaking!

IgnoreCal is this patch. Likely04 is this patch on top of D4327.

---------------------------------------------------------------------------------------------------------------------------
        Program           Size      Size    Allocs    Allocs   Runtime   Runtime   Elapsed   Elapsed  TotalMem  TotalMem
                     IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04 IgnoreCal  Likely04
---------------------------------------------------------------------------------------------------------------------------
            Min          -0.1%     -0.1%     -0.0%      0.0%     -3.1%     -5.3%     -3.0%     -6.2%     -1.7%     -0.7%
            Max          +0.0%     +0.0%      0.0%     +0.0%     +1.6%     +7.3%     +1.6%     +7.0%     +0.5%     +0.8%
 Geometric Mean          -0.0%     +0.0%     -0.0%     -0.0%     -0.3%     -1.1%     -0.5%     -1.2%     -0.0%     +0.0%

However I want to get this tested on at least 1-2 other CPUs before I put more work into getting this ready to land.

Full nofib log:

AndreasK updated this revision to Diff 16569.May 29 2018, 6:20 AM
  • Only look at reachable graphs when constructing the CFG.
  • Import cleanup.

cmmToCmm finds some dead blocks and removes conditional jumps to them.
However they are still members of the map used by CmmGraph.

By using revPostOrder we avoid adding these dead blocks to the CFG.

AndreasK updated this revision to Diff 16591.May 30 2018, 4:50 PM
  • Performance improvements
AndreasK updated this revision to Diff 16626.Jun 1 2018, 5:16 AM
  • Trim edges in chain linking passes.
AndreasK updated this revision to Diff 17038.Jun 21 2018, 1:00 PM
  • Rebase.

I've run nofib on a Xeon E3-1220 which jmct has provided me for a bench build and the results seem to hold up.

  • loglayout is just this patch
  • loglikely is D4327
  • likeAndLa is this + D4327
NoFib Results

--------------------------------------------------------------------------------
        Program           Size      Size      Size    Allocs    Allocs    Allocs   Runtime   Runtime   Runtime   Elapsed   Elapsed   Elapsed  TotalMem  TotalMem  TotalMem
                     loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa loglayout loglikely likeAndLa
--------------------------------------------------------------------------------
             CS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.195     0.178     0.191     0.195     0.178     0.192      0.0%      0.0%      0.0%
            CSD          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.6%     -5.5%     -4.3%     +1.6%     -5.4%     -4.3%      0.0%      0.0%      0.0%
             FS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +4.3%     -3.9%     -5.6%     +4.3%     -3.9%     -5.6%      0.0%      0.0%      0.0%
              S          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.2%     -2.2%     -1.5%     -2.2%     -2.3%     -1.6%      0.0%      0.0%      0.0%
             VS          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.7%     -1.8%     -1.8%     -2.7%     -1.8%     -1.8%      0.0%      0.0%      0.0%
            VSD          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.009     0.009     0.006     0.009     0.009      0.0%      0.0%      0.0%
            VSM          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%    -22.5%    -28.6%    -28.0%    -22.6%    -28.6%    -28.0%      0.0%      0.0%      0.0%
           anna          -0.2%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.067     0.067     0.068     0.067     0.067     0.068      0.0%      0.0%      0.0%
           ansi          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
           atom          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.187     0.191     0.191     0.187     0.191     0.190      0.0%      0.0%      0.0%
         awards          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         banner          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
     bernouilli          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.113     0.111     0.110     0.113     0.111     0.110      0.0%      0.0%      0.0%
   binary-trees          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.1%     +4.9%     -1.5%     +1.1%     +4.9%     -1.5%      0.0%      0.0%      0.0%
          boyer          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.026     0.025     0.025     0.026     0.025     0.025      0.0%      0.0%      0.0%
         boyer2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
           bspt          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
      cacheprof          -0.1%     +0.0%     -0.1%     -0.1%     -0.1%     -0.0%     -0.4%     -4.5%     -2.3%     -0.4%     -4.5%     -2.3%     -0.5%     -0.5%     +0.5%
       calendar          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
       cichelli          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.054     0.050     0.050     0.054     0.050     0.050      0.0%      0.0%      0.0%
        circsim          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +0.8%     +0.0%     +0.6%     +0.8%     -0.0%     +0.6%      0.0%      0.0%      0.0%
       clausify          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.024     0.022     0.021     0.024     0.022     0.021      0.0%      0.0%      0.0%
  comp_lab_zift          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.121     0.120     0.118     0.121     0.120     0.118      0.0%      0.0%      0.0%
       compress          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.087     0.085     0.084     0.087     0.085     0.084      0.0%      0.0%      0.0%
      compress2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.091     0.091     0.090     0.091     0.091     0.090      0.0%      0.0%      0.0%
    constraints          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.2%     -0.5%     -0.9%     -0.2%     -0.5%     -0.9%      0.0%      0.0%      0.0%
   cryptarithm1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -2.4%     -6.4%     -6.0%     -2.4%     -6.4%     -6.0%      0.0%      0.0%      0.0%
   cryptarithm2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.005     0.005     0.005     0.005     0.005     0.005      0.0%      0.0%      0.0%
            cse          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
   digits-of-e1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -1.6%     -7.0%     -7.8%     -1.6%     -7.1%     -7.8%      0.0%      0.0%      0.0%
   digits-of-e2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.9%     -8.3%     -7.8%     -0.9%     -8.3%     -7.8%      0.0%      0.0%      0.0%
          eliza          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
          event          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.100     0.097     0.098     0.100     0.097     0.098      0.0%      0.0%      0.0%
    exact-reals          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.3%     -1.6%     -2.6%     +0.3%     -1.6%     -2.6%      0.0%      0.0%      0.0%
         exp3_8          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.146     0.146     0.146     0.146     0.146     0.146      0.0%      0.0%      0.0%
         expert          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
 fannkuch-redux          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -1.8%     -3.1%     -3.8%     -1.8%     -3.1%     -3.8%      0.0%      0.0%      0.0%
          fasta          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -1.3%     -1.6%     -1.4%     -1.3%     -1.6%     -1.4%      0.0%      0.0%      0.0%
            fem          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.013     0.012     0.013     0.013     0.012     0.013      0.0%      0.0%      0.0%
            fft          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.022     0.022     0.022     0.022     0.022     0.022      0.0%      0.0%      0.0%
           fft2          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.032     0.032     0.032     0.032     0.032     0.032      0.0%      0.0%      0.0%
       fibheaps          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.015     0.015     0.015     0.015     0.015     0.015      0.0%      0.0%      0.0%
           fish          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.006     0.006     0.006     0.006     0.006      0.0%      0.0%      0.0%
          fluid          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.005     0.004     0.004     0.005     0.004     0.004      0.0%      0.0%      0.0%
         fulsom          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.180     0.180     0.180     0.180     0.180     0.180      0.0%      0.0%      0.0%
         gamteb          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.025     0.025     0.025     0.025     0.025     0.025      0.0%      0.0%      0.0%
            gcd          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.027     0.026     0.026     0.027     0.026     0.026      0.0%      0.0%      0.0%
    gen_regexps          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         genfft          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.020     0.021     0.020     0.020     0.021     0.020      0.0%      0.0%      0.0%
             gg          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
           grep          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         hidden          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +4.8%     +0.0%     +0.5%     +5.0%     -0.1%     +0.7%      0.0%      0.0%      0.0%
            hpg          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.115     0.118     0.115     0.115     0.118     0.115      0.0%      0.0%      0.0%
            ida          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.057     0.055     0.056     0.057     0.055     0.056      0.0%      0.0%      0.0%
          infer          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.033     0.033     0.033     0.033     0.033     0.033      0.0%      0.0%      0.0%
        integer          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.2%     -3.4%     -0.1%     -0.2%     -3.3%     -0.1%      0.0%      0.0%      0.0%
      integrate          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.093     0.087     0.083     0.093     0.087     0.083      0.0%      0.0%      0.0%
   k-nucleotide          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +1.3%     -1.1%     -2.3%     +1.3%     -1.1%     -2.3%      0.0%      0.0%      0.0%
          kahan          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +1.2%     +0.2%     +0.6%     +1.2%     +0.2%     +0.6%      0.0%      0.0%      0.0%
        knights          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.003     0.002     0.002     0.003     0.002     0.002      0.0%      0.0%      0.0%
         lambda          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -2.7%     -5.4%     -4.3%     -2.7%     -5.4%     -4.3%      0.0%      0.0%      0.0%
     last-piece          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.6%     -0.2%     -1.0%     +0.6%     -0.3%     -1.0%      0.0%      0.0%      0.0%
           lcss          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.6%     -0.8%     -1.0%     -0.6%     -0.9%     -1.0%      0.0%      0.0%      0.0%
           life          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.155     0.155     0.154     0.155     0.155     0.154      0.0%      0.0%      0.0%
           lift          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
         linear          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.3%     -1.5%     -0.9%     -0.3%     -1.5%     -0.9%      0.0%      0.0%      0.0%
      listcompr          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.064     0.062     0.062     0.064     0.062     0.062      0.0%      0.0%      0.0%
       listcopy          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.067     0.066     0.066     0.067     0.066     0.066      0.0%      0.0%      0.0%
       maillist          -0.1%     +0.0%     -0.0%      0.0%     +0.0%     +0.0%     0.042     0.042     0.042     0.043     0.043     0.043    +16.7%    +36.7%    +40.0%
         mandel          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.045     0.043     0.043     0.045     0.043     0.043      0.0%      0.0%      0.0%
        mandel2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.002     0.002     0.002     0.002     0.002     0.002      0.0%      0.0%      0.0%
           mate          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     -1.2%     -6.4%     -4.1%     -1.2%     -6.4%     -4.1%      0.0%      0.0%      0.0%
        minimax          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
        mkhprog          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
     multiplier          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.060     0.062     0.063     0.060     0.062     0.063      0.0%      0.0%      0.0%
         n-body          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.0%     +0.0%     -0.0%     -0.0%     +0.0%     -0.1%      0.0%      0.0%      0.0%
       nucleic2          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.048     0.052     0.049     0.048     0.052     0.049      0.0%      0.0%      0.0%
           para          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.188     0.191     0.186     0.188     0.191     0.186      0.0%      0.0%      0.0%
      paraffins          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.079     0.078     0.078     0.079     0.078     0.078      0.0%      0.0%      0.0%
         parser          -0.1%     -0.0%     -0.1%      0.0%      0.0%      0.0%     0.019     0.018     0.018     0.019     0.018     0.018      0.0%      0.0%      0.0%
        parstof          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.003     0.003     0.003     0.003     0.003     0.003      0.0%      0.0%      0.0%
            pic          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.004     0.004     0.004     0.004     0.004     0.004      0.0%      0.0%      0.0%
       pidigits          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.1%     -0.2%     -0.3%     -0.3%     -0.1%     -0.3%      0.0%      0.0%      0.0%
          power          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     -0.3%     -2.7%     -1.4%     -0.2%     -2.7%     -1.4%      0.0%      0.0%      0.0%
         pretty          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
         primes          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.045     0.044     0.045     0.045     0.044     0.045      0.0%      0.0%      0.0%
      primetest          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.074     0.072     0.071     0.074     0.072     0.071      0.0%      0.0%      0.0%
         prolog          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
         puzzle          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.082     0.081     0.081     0.082     0.081     0.081      0.0%      0.0%      0.0%
         queens          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.008     0.008     0.008     0.008     0.008     0.008      0.0%      0.0%      0.0%
        reptile          -0.1%      0.0%     -0.1%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
reverse-complem          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.077     0.074     0.073     0.077     0.074     0.073      0.0%      0.0%      0.0%
        rewrite          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.012     0.011     0.011     0.012     0.011     0.011      0.0%      0.0%      0.0%
           rfib          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.010     0.010     0.010     0.010     0.010     0.010      0.0%      0.0%      0.0%
            rsa          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.016     0.016     0.016     0.016     0.016     0.015      0.0%      0.0%      0.0%
            scc          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.000     0.000     0.000     0.000     0.000     0.000      0.0%      0.0%      0.0%
          sched          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.013     0.012     0.012     0.013     0.012     0.012      0.0%      0.0%      0.0%
            scs          -0.2%     -0.0%     -0.1%      0.0%      0.0%      0.0%     +0.3%     -3.5%     -2.5%     +0.2%     -3.5%     -2.6%      0.0%      0.0%      0.0%
         simple          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.122     0.126     0.123     0.122     0.126     0.123      0.0%      0.0%      0.0%
          solid          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.085     0.083     0.082     0.085     0.083     0.082      0.0%      0.0%      0.0%
        sorting          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
  spectral-norm          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%      0.0%     -0.0%     -0.0%     +0.0%     -0.0%     -0.0%      0.0%      0.0%      0.0%
         sphere          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.031     0.031     0.031     0.031     0.031     0.031      0.0%      0.0%      0.0%
         symalg          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.006     0.006     0.006     0.006     0.006     0.006      0.0%      0.0%      0.0%
            tak          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.007     0.007     0.007     0.007     0.007     0.007      0.0%      0.0%      0.0%
      transform          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     +0.6%     -3.1%     -2.5%     +0.6%     -3.1%     -2.5%      0.0%      0.0%      0.0%
       treejoin          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.090     0.090     0.090     0.090     0.090     0.090      0.0%      0.0%      0.0%
      typecheck          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.155     0.154     0.154     0.155     0.154     0.155      0.0%      0.0%      0.0%
        veritas          -0.1%     +0.0%     -0.1%      0.0%      0.0%      0.0%     0.001     0.001     0.001     0.001     0.001     0.001      0.0%      0.0%      0.0%
           wang          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.066     0.065     0.063     0.066     0.065     0.063      0.0%      0.0%      0.0%
      wave4main          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.180     0.178     0.179     0.180     0.178     0.179      0.0%      0.0%      0.0%
   wheel-sieve1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     +1.1%     -0.7%     +0.7%     +1.1%     -0.7%     +0.7%      0.0%      0.0%      0.0%
   wheel-sieve2          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.133     0.133     0.134     0.133     0.133     0.134      0.0%      0.0%      0.0%
           x2n1          -0.1%     +0.0%     -0.0%      0.0%      0.0%      0.0%     0.002     0.002     0.002     0.002     0.002     0.002      0.0%      0.0%      0.0%
--------------------------------------------------------------------------------
            Min          -0.2%     -0.0%     -0.1%     -0.1%     -0.1%     -0.0%    -22.5%    -28.6%    -28.0%    -22.6%    -28.6%    -28.0%     -0.5%     -0.5%      0.0%
            Max          -0.1%     +0.0%     -0.0%      0.0%     +0.0%     +0.0%     +4.8%     +4.9%     +0.7%     +5.0%     +4.9%     +0.7%    +16.7%    +36.7%    +40.0%
 Geometric Mean          -0.1%     +0.0%     -0.1%     -0.0%     -0.0%     -0.0%     -0.9%     -3.4%     -3.2%     -0.9%     -3.4%     -3.2%     +0.1%     +0.3%     +0.3%

The only benchmark where memory differed is maillist which I attribute to it's undeterministic GC behaviour.
That the likelyhood patch alone performs so well is a bit surprising given that this isn't the case on skylake.

I will rerun the benchmark to make sure these results are sane but it's good to see that it's still a win after rebasing and on a different CPU.

jmct added a comment.Jun 27 2018, 11:02 PM

Here are the results on the Ryzen 7 I am able to access. I ran it on a quiet headless machine with NoFibRuns=30. The results differ a bit. It could be noise though.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
             CS          -0.2%      0.0%     0.150     0.150      0.0%
            CSD          -0.3%      0.0%    -11.5%    -11.5%      0.0%
             FS          -0.2%      0.0%     -1.2%     -1.2%      0.0%
              S          -0.3%      0.0%     +0.8%     +0.8%      0.0%
             VS          -0.3%      0.0%     -0.4%     -0.3%      0.0%
            VSD          -0.3%      0.0%     0.008     0.008      0.0%
            VSM          -0.2%      0.0%    -12.9%    -12.9%      0.0%
           anna          -0.5%      0.0%     0.070     0.070      0.0%
           ansi          -0.4%      0.0%     0.000     0.000      0.0%
           atom          -0.4%      0.0%     0.133     0.133      0.0%
         awards          -0.3%      0.0%     0.000     0.000      0.0%
         banner          -0.2%      0.0%     0.000     0.000      0.0%
     bernouilli          -0.4%      0.0%     0.095     0.095      0.0%
   binary-trees          -0.4%      0.0%     +4.5%     +4.5%      0.0%
          boyer          -0.3%      0.0%     0.027     0.027      0.0%
         boyer2          -0.3%      0.0%     0.005     0.005      0.0%
           bspt          -0.3%      0.0%     0.005     0.005      0.0%
      cacheprof          -0.4%     +0.1%     -1.5%     -1.5%     -0.9%
       calendar          -0.3%      0.0%     0.000     0.000      0.0%
       cichelli          -0.3%      0.0%     0.055     0.055      0.0%
        circsim          -0.4%      0.0%     +5.0%     +5.0%      0.0%
       clausify          -0.3%      0.0%     0.023     0.023      0.0%
  comp_lab_zift          -0.4%      0.0%     0.107     0.107      0.0%
       compress          -0.3%      0.0%     0.074     0.074      0.0%
      compress2          -0.2%      0.0%     0.078     0.078      0.0%
    constraints          -0.3%      0.0%     -0.9%     -0.9%      0.0%
   cryptarithm1          -0.3%      0.0%     +1.7%     +1.7%      0.0%
   cryptarithm2          -0.3%      0.0%     0.006     0.006      0.0%
            cse          -0.3%      0.0%     0.001     0.001      0.0%
   digits-of-e1          -0.4%      0.0%     -0.2%     -0.2%      0.0%
   digits-of-e2          -0.4%      0.0%     +1.7%     +1.7%      0.0%
          eliza          -0.3%      0.0%     0.000     0.000      0.0%
          event          -0.3%      0.0%     0.091     0.091      0.0%
    exact-reals          -0.4%      0.0%     -0.1%     -0.2%      0.0%
         exp3_8          -0.3%      0.0%     0.097     0.097      0.0%
         expert          -0.3%      0.0%     0.000     0.000      0.0%
 fannkuch-redux          -0.3%      0.0%     -2.6%     -2.6%      0.0%
          fasta          -0.3%      0.0%     0.168     0.168      0.0%
            fem          -0.4%      0.0%     0.012     0.012      0.0%
            fft          -0.4%      0.0%     0.020     0.021      0.0%
           fft2          -0.4%      0.0%     0.027     0.027      0.0%
       fibheaps          -0.3%      0.0%     0.015     0.015      0.0%
           fish          -0.3%      0.0%     0.006     0.006      0.0%
          fluid          -0.5%      0.0%     0.005     0.005      0.0%
         fulsom          -0.4%      0.0%     0.152     0.152      0.0%
         gamteb          -0.4%      0.0%     0.025     0.025      0.0%
            gcd          -0.3%      0.0%     0.026     0.026      0.0%
    gen_regexps          -0.3%      0.0%     0.000     0.000      0.0%
         genfft          -0.3%      0.0%     0.021     0.021      0.0%
             gg          -0.4%      0.0%     0.007     0.007      0.0%
           grep          -0.3%      0.0%     0.000     0.000      0.0%
         hidden          -0.4%      0.0%     -0.1%     -0.1%      0.0%
            hpg          -0.4%      0.0%     0.058     0.058      0.0%
            ida          -0.3%      0.0%     0.058     0.058      0.0%
          infer          -0.5%      0.0%     0.034     0.034      0.0%
        integer          -0.3%      0.0%     -1.6%     -1.6%      0.0%
      integrate          -0.3%      0.0%     0.066     0.066      0.0%
   k-nucleotide          -0.4%     +0.0%     -8.7%     -8.7%      0.0%
          kahan          -0.3%      0.0%     0.175     0.175      0.0%
        knights          -0.3%      0.0%     0.003     0.003      0.0%
         lambda          -0.3%      0.0%     -1.5%     -1.5%      0.0%
     last-piece          -0.3%      0.0%     +3.3%     +3.3%      0.0%
           lcss          -0.3%      0.0%     -0.4%     -0.4%      0.0%
           life          -0.3%      0.0%     0.131     0.131      0.0%
           lift          -0.3%      0.0%     0.001     0.001      0.0%
         linear          -0.4%      0.0%     -1.2%     -1.2%      0.0%
      listcompr          -0.2%      0.0%     0.058     0.059      0.0%
       listcopy          -0.2%      0.0%     0.061     0.062      0.0%
       maillist          -0.3%     -0.0%     0.032     0.032    +10.5%
         mandel          -0.4%      0.0%     0.043     0.043      0.0%
        mandel2          -0.3%      0.0%     0.002     0.002      0.0%
           mate          -0.4%      0.0%     -2.9%     -2.9%      0.0%
        minimax          -0.3%      0.0%     0.002     0.002      0.0%
        mkhprog          -0.3%      0.0%     0.001     0.001      0.0%
     multiplier          -0.3%      0.0%     0.069     0.069      0.0%
         n-body          -0.4%      0.0%     +0.1%     +0.1%      0.0%
       nucleic2          -0.4%      0.0%     0.054     0.054      0.0%
           para          -0.3%      0.0%     0.172     0.172      0.0%
      paraffins          -0.3%      0.0%     0.067     0.068      0.0%
         parser          -0.4%      0.0%     0.018     0.018      0.0%
        parstof          -0.3%      0.0%     0.004     0.004      0.0%
            pic          -0.4%      0.0%     0.003     0.003      0.0%
       pidigits          -0.4%      0.0%     +0.7%     +0.8%      0.0%
          power          -0.4%      0.0%     0.190     0.191      0.0%
         pretty          -0.3%      0.0%     0.000     0.000      0.0%
         primes          -0.3%      0.0%     0.033     0.033      0.0%
      primetest          -0.4%      0.0%     0.060     0.060      0.0%
         prolog          -0.3%      0.0%     0.001     0.001      0.0%
         puzzle          -0.2%      0.0%     0.098     0.098      0.0%
         queens          -0.3%      0.0%     0.008     0.008      0.0%
        reptile          -0.4%      0.0%     0.007     0.007      0.0%
reverse-complem          -0.3%      0.0%     0.076     0.076      0.0%
        rewrite          -0.3%      0.0%     0.014     0.014      0.0%
           rfib          -0.4%      0.0%     0.011     0.011      0.0%
            rsa          -0.4%      0.0%     0.016     0.016      0.0%
            scc          -0.3%      0.0%     0.000     0.000      0.0%
          sched          -0.3%      0.0%     0.015     0.016      0.0%
            scs          -0.6%      0.0%     +0.4%     +0.3%      0.0%
         simple          -0.5%      0.0%     0.099     0.100      0.0%
          solid          -0.4%      0.0%     0.079     0.079      0.0%
        sorting          -0.3%      0.0%     0.001     0.001      0.0%
  spectral-norm          -0.4%      0.0%     -0.3%     -0.3%      0.0%
         sphere          -0.4%      0.0%     0.034     0.034      0.0%
         symalg          -0.4%      0.0%     0.005     0.005      0.0%
            tak          -0.3%      0.0%     0.008     0.008      0.0%
      transform          -0.4%      0.0%     0.192     0.192      0.0%
       treejoin          -0.3%      0.0%     0.077     0.077      0.0%
      typecheck          -0.3%      0.0%     0.156     0.156      0.0%
        veritas          -0.4%      0.0%     0.001     0.001      0.0%
           wang          -0.4%      0.0%     0.058     0.058      0.0%
      wave4main          -0.3%      0.0%     0.174     0.175      0.0%
   wheel-sieve1          -0.3%      0.0%     0.129     0.129      0.0%
   wheel-sieve2          -0.3%      0.0%     0.102     0.103      0.0%
           x2n1          -0.4%      0.0%     0.001     0.001      0.0%
--------------------------------------------------------------------------------
            Min          -0.6%     -0.0%    -12.9%    -12.9%     -0.9%
            Max          -0.2%     +0.1%     +5.0%     +5.0%    +10.5%
 Geometric Mean          -0.3%     +0.0%     -1.2%     -1.2%     +0.1%
In D4726#135105, @jmct wrote:

Here are the results on the Ryzen 7 I am able to access. I ran it on a quiet headless machine with NoFibRuns=30. The results differ a bit. It could be noise though.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
            Min          -0.6%     -0.0%    -12.9%    -12.9%     -0.9%
            Max          -0.2%     +0.1%     +5.0%     +5.0%    +10.5%
 Geometric Mean          -0.3%     +0.0%     -1.2%     -1.2%     +0.1%

So currently we measured:

  • -0.3% on Skylake
  • -0.9% on Xeon E3-1220
  • -1.2% for Ryzen.

I would say this seems workable!

I also have two more ideas I want to look into which might improve things further.

In particular:

  • We could run a basic loop detection algorithm over the CFG and bump the priority of backedges to make sure loop bodies are layout out sequentially.
  • Currently it's somewhat arbitrary which edge is first processed if two have the same priority. Maybe we could get some performance there as well.
jmct added a comment.EditedJun 30 2018, 8:29 PM

This is cool work! I have a few questions,

Currently, the weights in the control flow graph, lines 242-251 of compiler/nativeGen/CFG.hs, here:

compiler/nativeGen/CFG.hs
  --Penalize info tables since they prevent eliminating
  --the jump
  | mapMember dest info -> [((bid,dest),60)]
  | otherwise           -> [((bid,dest),100)]
CmmCondBranch _c t f l 
  -- Prefer false branch to keep in line with old
  -- layout algorithm.
  | l == Nothing -> [((bid,f),51),((bid,t),49)]
  | l == Just True -> [((bid,f),10),((bid,t),90)]
  | l == Just False -> [((bid,t),10),((bid,f),90)]

What's the justification of the different weights? I can see the intuition, but I don't have a good understanding of how changing those weights would affect the results of the algorithm. Direct jumps being 100 makes intuitive sense, but what about the others? For example, why 10/90 and not 20/80?

Another thought/concern:

The performance differences between the architectures is interesting. My first guess was that maybe it was 'more cache == better performance', and so I looked that the available cache on the Xeon and the Ryzen and got the following (I think this is mostly an L1 cache, thing, but I've included all the cache lines just in case):

Xeon
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
Ryzen
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K

So at first glance this would support my hypothesis, since the Ryzen has double the L1i cache of the Xeon. But what does the Skylake look like? I would like to have a better understanding of what is causing the differences in the performance numbers. If the new code layout is rarely a performance regression, but gets more beneficial as the L1i cache scales it'd be worth getting in ghc-mainline (IMO). Another experiment might be worth trying it on a machine with very little cache and see how that goes.

As for your two paths forward, I think the first one seems more immediately beneficial, especially since a lot of performance critical code ends up having an inner loop.

It would be nice for you to link to the wiki page you made in the note [Chain based CFG serialization]. Your blog post and the wiki page were really helpful in understanding what was going on, so thank you for that! Let's make sure future GHC devs can find them :D

In D4726#135331, @jmct wrote:

This is cool work! I have a few questions,

What's the justification of the different weights? I can see the intuition, but I don't have a good understanding of how changing those weights would affect the results of the algorithm. Direct jumps being 100 makes intuitive sense, but what about the others? For example, why 10/90 and not 20/80?

The numbers currently only matter relative to each other.
If we branch a->b, a->c what matters is if edgeWeight a b > edgeWeight a c.
If the difference is 1 or 100 currently doesn't factor into it.

why 10/90 and not 20/80

Currently when we have a branching block with likely annotation in 99% of all cases it will be a heap/stack check or loop.
Both of which would be better described as "once in a blue moon" than unlikely.
Prefering to keep with with multiples of 10 I chose 10 instead of 1 so arrived at 10/90.

The numbers currently don't represent a cost model, just a more is better ranking.
This currently works fine. We don't have real data on how likely branches really are anyway.
So that leaves us with stack/heap checks, info tables and loops were we can say something. And for these the relative ordering is good enough.

I did write it using weights instead of a strict ordering so that if we ever profile branches or similar
we can just rethink the weight assignment and the algorithm itself will still work.

But you are right this isn't documented and I will look to add more info there.

Another thought/concern:

The performance differences between the architectures is interesting. My first guess was that maybe it was 'more cache == better performance', and so I looked that the available cache on the Xeon and the Ryzen and got the following (I think this is mostly an L1 cache, thing, but I've included all the cache lines just in case):

Xeon
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
Ryzen
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K

So at first glance this would support my hypothesis, since the Ryzen has double the L1i cache of the Xeon. But what does the Skylake look like? I would like to have a better understanding of what is causing the differences in the performance numbers. If the new code layout is rarely a performance regression, but gets more beneficial as the L1i cache scales it'd be worth getting in ghc-mainline (IMO). Another experiment might be worth trying it on a machine with very little cache and see how that goes.

Skylake has the same cache sizes as the Xeon (And since the Xeon is Sandy-Bridge based it's pretty similar in general).
I've also run benchmarks on bens Devil's Canyon(~Haswell) and the results were about the same as for the skylake but with more noise so I didn't include them here.

I find it highly unlikely that improvements will scale with larger cache.
Only looking at cache this patch improves the hit rate. But so does a larger cache!
If anything I would expect the difference to get smaller as cache sizes grow.

As for the difference between the Intels maybe skylakes larger reorder buffer
reduces the impact of I-Cache misses. My skylake also uses faster DDR4 so that might also be a factor.
In the end the results are not THAT different.

I assume the Intel/AMD difference stems from differences in their microarchitecture.

  • Different pretch algorithms
  • The ryzen has not only a larger instruction cache but also a larger uOp cache.
  • Ryzen has a slightly larger delay when reading from L2.
  • And a lot more which don't seem immediately relevant.

Not sure if it's relevant but I also measured the skylake under windows.
While I don't expect this to make a big difference it could be a contributing factor.

Overall they are close enough that I won't worry about it too much.
For a runtime measurement taken by two people measuring over 4 systems, two Oses and 4 CPU's it's very consistent imo.

As for your two paths forward, I think the first one seems more immediately beneficial, especially since a lot of performance critical code ends up having an inner loop.

Agreed.

It would be nice for you to link to the wiki page you made in the note [Chain based CFG serialization]. Your blog post and the wiki page were really helpful in understanding what was going on, so thank you for that! Let's make sure future GHC devs can find them :D

Good idea.

AndreasK updated this revision to Diff 17282.Jul 12 2018, 10:20 AM
  • Trim edges in chain linking passes.
  • Explain branch weights, add backedge detection.
  • Refactor where we look for loops
  • Refactor some aspects of the layout algorithm and code.
  • Change CFG weights, refactor CFG creation slightly.
  • Ignore info tables
  • Use edge weights for old algorithm too
AndreasK updated this revision to Diff 17333.Jul 15 2018, 3:12 AM
  • Allow to dynamically specify cfg weights for benchmarking (temporary)

    This allows me to easily benchmark a wider array of configurations.
jmct added a comment.EditedJul 30 2018, 10:00 AM

The one question I have remaining is whether using unsafeGlobalDynFlags. I've left a note on the line in question to that effect.

Otherwise, just a bit of comment and TODO cleanup (which I think you've already done on your local branch during our call), getting it past Harbormaster, and then it LGTM.

compiler/nativeGen/CFG.hs
380

Either the flags should be passed in, or (if passing them in turns out to be costly), make a note about why this decision was made.

AndreasK updated this revision to Diff 17518.Jul 30 2018, 10:25 AM
  • Only print nodes not covered by edges explicitly
  • Small improvements from jmcts review
  • Assert include fixup
  • Remove unused bindings
AndreasK updated this revision to Diff 17659.Aug 14 2018, 9:27 AM
  • Be more aggresive in removing dead code.
  • Add last2 to get last two elements from list efficiently
  • Invert cond branch jump order at the asm level in order to eliminate one.
  • Decouple conditional branch weights from cmm branch order..
AndreasK updated this revision to Diff 17672.Aug 15 2018, 7:24 AM
  • Remove unused parameters/imports.
AndreasK updated this revision to Diff 17701.Aug 20 2018, 8:39 AM
  • Line widths and minor refactorings that go along with it.
  • Update flag description, make vanilla default on old layout.
  • Fix info table check
  • Disable new layout code on unsupported target archs
AndreasK updated this revision to Diff 17702.Aug 20 2018, 9:25 AM
  • Remove unused import
AndreasK updated this revision to Diff 17703.Aug 20 2018, 11:26 AM
  • Rename test
AndreasK updated this revision to Diff 17715.Aug 21 2018, 8:42 AM
  • Update test allocations
AndreasK updated this revision to Diff 17757.Aug 23 2018, 3:48 AM
  • Add copyright notice [skip ci]
AndreasK retitled this revision from DO NOT MERGE - Experimental code layout to NCG: New code layout algorithm..Aug 30 2018, 2:18 PM
AndreasK edited the summary of this revision. (Show Details)
AndreasK updated this revision to Diff 17873.Aug 30 2018, 2:37 PM
  • Update flag documentation
  • Remove unneccesary derived instances
AndreasK updated this revision to Diff 17879.Sep 1 2018, 4:07 PM
  • Rebase
  • Massive speed improvements for new code layout in edge cases.
AndreasK updated this revision to Diff 17882.Sep 2 2018, 5:53 AM
  • Update stats to account for OsX allocation results

If this adds a big compile time overhead, perhaps it should be enabled for -O2 only?

If this adds a big compile time overhead, perhaps it should be enabled for -O2 only?

After the latest compile time perfomance changes it should be fast enough for O1.

For nofib compile time increased by less than 1% after the latest changes.
+0.8% when using -O2 and +0.6% when using -O1.
So I think enabling it for -O1 is reasonable.

The only time I've noticed big differences was with -O0 where there was a about ~7% difference.
But for that file the difference shrank to <0.2% at -O1/2 as well.

AndreasK planned changes to this revision.Sep 14 2018, 2:09 PM

I found some slowdowns in the binary benchmarks with this patch which can be traced back to how some Cmm files are compiled in the RTS.
Not sure if the resulting code is actually worse as that depends on the actual program, but it's definitely worse for the binary benchmarks.

Intentional or not the old layout code seems mostly faster because it places if/else branch targets such that it lines up better with the hot code paths.
Given that dependence on such behavior had lead to regressions in the past making these explicit seems to be a good idea either way.

A small part also happens when a loop is placed "rotated" such that the loop header is placed in the middle of linear code sequence such that it takes an extra jump to enter the loop.
The overhead is rather small, but for code that is expected to only execute the loop body once unless an error occurs (spin locks, waiting for black holes, ...) it can still make a noticeable
difference.

The first part is pretty independent of this patch and I will likely open another diff if I decide to add the annotations.

But either way till I can find the time to look at these I don't think it makes much sense to merge.

AndreasK updated this revision to Diff 18136.Thu, Sep 27, 5:23 PM
  • Rebasing
AndreasK updated this revision to Diff 18137.Thu, Sep 27, 6:15 PM
  • Rebase artefact

Partial review

compiler/cmm/CmmNode.hs
578

Strict pattern match?

580–581

Strict

588

instead of error, use mapLookup and expectJust

compiler/main/DynFlags.hs
493–494

Why do we need two flags? I would have thought that a single flag to enable the new algorithm should be enough.

compiler/nativeGen/CFG.hs
75

LabelMap is more efficient than M.Map Label

137–140

I think you can do this more succinctly with a mapAlter

142

You probably want this to return a LabelSet, otherwise it is likely to have duplicates. It looks like we usually want to immediately listToSet on the result anyway.

147–152

= M.member node m || any (M.member node) m

154

Specifically, I think this is "check that the CFG only mentions nodes drawn from the given list" (and perhasp the list should be a set)

170

Needs Haddock

Thanks for the review!

I think all of your suggestions are good.

Around ICFP some private things came up back home which will take most of my spare time for a bit.
I think I will find time to put the last touches on this by end of October though.