NCG: New code layout algorithm.

Authored by AndreasK on May 23 2018, 3:57 PM.



This patch implements a new code layout algorithm.
It has been tested for x86 and is disabled on other platforms.

Performance varies slightly be CPU/Machine but in general seems to be better
by around 2%.
Nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.

Other benchmarks includes at least the benchmarks of: aeson, vector, megaparsec, attoparsec,
containers, text and xeno.

While the magnitude of gains differed three different CPUs where tested with
all getting faster although to differing degrees. I tested: Sandy Bridge(Xeon), Haswell,

  • Library benchmark results summarized:
    • containers: ~1.5% faster
    • aeson: ~2% faster
    • megaparsec: ~2-5% faster
    • xml library benchmarks: 0.2%-1.1% faster
    • vector-benchmarks: 1-4% faster
    • text: 5.5% faster

On average GHC compile times go down, as GHC compiled with the new layout
is faster than the overhead introduced by using the new layout algorithm,

Things this patch does:

  • Move code responsilbe for block layout in it's own module.
  • Move the NcgImpl Class into the NCGMonad module.
  • Extract a control flow graph from the input cmm.
  • Update this cfg to keep it in sync with changes during asm codegen. This has been tested on x64 but should work on x86. Other platforms still use the old codelayout.
  • Assign weights to the edges in the CFG based on type and limited static analysis which are then used for block layout.
  • Once we have the final code layout eliminate some redundant jumps.

    In particular turn a sequences of: jne .foo jmp .bar foo: into je bar foo: ..
Test Plan


Diff Detail

rGHC Glasgow Haskell Compiler
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
There are a very large number of changes, so older changes are hidden. Show Older Changes
jmct added a comment.EditedJul 30 2018, 10:00 AM

The one question I have remaining is whether using unsafeGlobalDynFlags. I've left a note on the line in question to that effect.

Otherwise, just a bit of comment and TODO cleanup (which I think you've already done on your local branch during our call), getting it past Harbormaster, and then it LGTM.


Either the flags should be passed in, or (if passing them in turns out to be costly), make a note about why this decision was made.

AndreasK updated this revision to Diff 17518.Jul 30 2018, 10:25 AM
  • Only print nodes not covered by edges explicitly
  • Small improvements from jmcts review
  • Assert include fixup
  • Remove unused bindings
AndreasK updated this revision to Diff 17659.Aug 14 2018, 9:27 AM
  • Be more aggresive in removing dead code.
  • Add last2 to get last two elements from list efficiently
  • Invert cond branch jump order at the asm level in order to eliminate one.
  • Decouple conditional branch weights from cmm branch order..
AndreasK updated this revision to Diff 17672.Aug 15 2018, 7:24 AM
  • Remove unused parameters/imports.
AndreasK updated this revision to Diff 17701.Aug 20 2018, 8:39 AM
  • Line widths and minor refactorings that go along with it.
  • Update flag description, make vanilla default on old layout.
  • Fix info table check
  • Disable new layout code on unsupported target archs
AndreasK updated this revision to Diff 17702.Aug 20 2018, 9:25 AM
  • Remove unused import
AndreasK updated this revision to Diff 17703.Aug 20 2018, 11:26 AM
  • Rename test
AndreasK updated this revision to Diff 17715.Aug 21 2018, 8:42 AM
  • Update test allocations
AndreasK updated this revision to Diff 17757.Aug 23 2018, 3:48 AM
  • Add copyright notice [skip ci]
AndreasK retitled this revision from DO NOT MERGE - Experimental code layout to NCG: New code layout algorithm..Aug 30 2018, 2:18 PM
AndreasK edited the summary of this revision. (Show Details)
AndreasK updated this revision to Diff 17873.Aug 30 2018, 2:37 PM
  • Update flag documentation
  • Remove unneccesary derived instances
AndreasK updated this revision to Diff 17879.Sep 1 2018, 4:07 PM
  • Rebase
  • Massive speed improvements for new code layout in edge cases.
AndreasK updated this revision to Diff 17882.Sep 2 2018, 5:53 AM
  • Update stats to account for OsX allocation results

If this adds a big compile time overhead, perhaps it should be enabled for -O2 only?

If this adds a big compile time overhead, perhaps it should be enabled for -O2 only?

After the latest compile time perfomance changes it should be fast enough for O1.

For nofib compile time increased by less than 1% after the latest changes.
+0.8% when using -O2 and +0.6% when using -O1.
So I think enabling it for -O1 is reasonable.

The only time I've noticed big differences was with -O0 where there was a about ~7% difference.
But for that file the difference shrank to <0.2% at -O1/2 as well.

AndreasK planned changes to this revision.Sep 14 2018, 2:09 PM

I found some slowdowns in the binary benchmarks with this patch which can be traced back to how some Cmm files are compiled in the RTS.
Not sure if the resulting code is actually worse as that depends on the actual program, but it's definitely worse for the binary benchmarks.

Intentional or not the old layout code seems mostly faster because it places if/else branch targets such that it lines up better with the hot code paths.
Given that dependence on such behavior had lead to regressions in the past making these explicit seems to be a good idea either way.

A small part also happens when a loop is placed "rotated" such that the loop header is placed in the middle of linear code sequence such that it takes an extra jump to enter the loop.
The overhead is rather small, but for code that is expected to only execute the loop body once unless an error occurs (spin locks, waiting for black holes, ...) it can still make a noticeable

The first part is pretty independent of this patch and I will likely open another diff if I decide to add the annotations.

But either way till I can find the time to look at these I don't think it makes much sense to merge.

AndreasK updated this revision to Diff 18136.Sep 27 2018, 5:23 PM
  • Rebasing
AndreasK updated this revision to Diff 18137.Sep 27 2018, 6:15 PM
  • Rebase artefact

Partial review


Strict pattern match?




instead of error, use mapLookup and expectJust


Why do we need two flags? I would have thought that a single flag to enable the new algorithm should be enough.


LabelMap is more efficient than M.Map Label


I think you can do this more succinctly with a mapAlter


You probably want this to return a LabelSet, otherwise it is likely to have duplicates. It looks like we usually want to immediately listToSet on the result anyway.


= M.member node m || any (M.member node) m


Specifically, I think this is "check that the CFG only mentions nodes drawn from the given list" (and perhasp the list should be a set)


Needs Haddock

Thanks for the review!

I think all of your suggestions are good.

Around ICFP some private things came up back home which will take most of my spare time for a bit.
I think I will find time to put the last touches on this by end of October though.

A bit more review


Seems slightly strange that we don't need a BlockId here. In fact a BlockId would unambiguously identify the source node, so it would give you more information.


I'm missing some context about what shortcutting actually does here. Needs some cross-referencing or perhaps move this function elsewhere.


Just addEdge from to (mkWeightInfo weight)


Use either BlockId or Label consistently (I think you're using BlockId elsewhere)


maybe [] M.toList $ M.lookup bid m


= edgeWeight $ getEdgeInfo from to m

(actually not clear whether it's worth defining this at all)

AndreasK updated this revision to Diff 18388.Oct 20 2018, 5:53 AM
AndreasK marked 17 inline comments as done.
  • Refactorings based on Phab feedback.

Thank you for the feedback!


With this patch we have three options.

  • Use the old approach as is.
  • Use the old approach but instead of the last jump target try to place the priority target next.
  • Use the new algorithm.

The old code will remain because it will still be used for non-x86 backends.
So might as well make it accessible. Also helps with overtuned code that implicitly depends on the old behavior.

For VERY large decision trees like functions there can also still be a smallish but measurable compile time increase in the new algorithm.
So I want to have the second option because it has slightly better runtime performance than the old approach while
not suffering from this edge case.

It's not noticeable unless one really tortures the compiler. But I'm sure someone does torture the compiler and will be glad to have it.

And the new layout algorithm options is self explanatory at this point I assume.

I don't think having the additional option(s) is expensive. But I guess we could reduce it to two:

  • Use the new algorithm.
  • Use the old algorithm. But still use CFG weights unless they are not available for this backend.

But there is always overoptimized code that depends on the old behaviour and what not.

Either way I would rather do this as a follow up patch if people think that's the way to go.


Good catch thanks.

Funny enough switching to LabelMap increased compile time allocations slightly. Something like 0.4% at -O0 so close to being noise.
I hope/think that's because of changes in inlining or something of the sort.


I don't really disagree here.

In fact a BlockId would unambiguously identify the source node, so it would give you more information.

I thought about that but I couldn't figure out a way where this seems really better.

  • It would in the trivial way only give us the source block not the node. Currently I only need info from the exit node.
  • If we want to look up the exit node directly we have to create and keep up to date a mapping from BlockId to exit CmmNode. This can get complex when the backend deletes, or inserts new basic blocks.

If it's very helpful for other optimizations we can always change this later.
In the end keeping some information between passes by pinning it to the edges seemed simpler for the block layout use case.


As far as I know mapAlter has a small overhead because of the maybes if we already know entries will exist and we will never remove the entry.

Using mapAdjust (which doesn't exist currently) would be an option but I tried to avoid using Map functions somewhat to make it easier to switch out the underlying data structure if we need to in the future.


I actually intended it to do the exact comparison. My mistake was to assume that setDifference is the symmetric version.


I will expand the comment with more context and point to the shortcut info in the Cmm shortcut info.


Good catch.


Good Catch.

Sometimes I ended up looking at code using Label somewhere else while writing a function and my brain just switches over to using Label and I only notice the next time I look at it.


In general I think such definitions can be helpful for clarity as they reduce irrelevant noise when reading code.

In this case though turns out I have refactored away all uses by now. So I deleted it.

AndreasK marked 9 inline comments as done.Oct 20 2018, 6:29 AM

As I want to get this into 8.8 a last call for reviews! I will look to land this in the next two days.

AndreasK updated this revision to Diff 18705.Nov 16 2018, 9:52 AM
  • Use foldMap to collect nodes of the cfg
  • Bikeshedding of flag names
  • More bikeshedding
RyanGlScott accepted this revision.Nov 16 2018, 9:59 AM
RyanGlScott added a subscriber: RyanGlScott.

These flag names look good to me. Great patch.

This revision is now accepted and ready to land.Nov 16 2018, 9:59 AM
AndreasK updated this revision to Diff 18714.Nov 17 2018, 2:54 AM
  • Rebase, add release note.
AndreasK edited the summary of this revision. (Show Details)Nov 17 2018, 3:01 AM
This revision was automatically updated to reflect the committed changes.