cmm/CmmLayoutStack: avoid generating unnecessary reloads
ClosedPublic

Authored by michalt on May 14 2017, 9:52 AM.

Details

Summary

cmm/CmmLayoutStack: avoid generating unnecessary reloads

This tries to be more precise when generating reloads of local
registers in proc points. Previously we'd reload all local registers
that were live. But we used liveness information that assumed local
registers survive native calls. For the purpose of reloading registers
this is an overapproximation and might lead to generating huge amounts
of unnecessary reloads (in case there's another proc point before the
register is used).

This change takes the approach of moving the generation of reloads to
a second pass over the Cmm, which allows to recompute the liveness and
can use the knowledge that local registers do *not* survive calls.
This leads to generating only useful reloads. For an extreme example
where this helps a lot please see T3294. This should also fix Trac #7198

Finally, this re-introduces the code to do Cmm rewriting using in
Dataflow module (with the difference that we know operate on a whole
block at a time).

Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
michalt created this revision.May 14 2017, 9:52 AM

This is a bit of an RFC patch - there are a few things that might need some discussions (general approach in CmmLayoutStack, interface for Cmm rewriting, etc.). So please let me know what you think about it!

simonmar edited edge metadata.May 15 2017, 2:50 AM

Nice observation about the unnecessary reloads, and the results look good. I presume with -O the unnecessary reloads will be eliminated by CmmSink, but eliminating them earlier saves compile time and helps with -O0.

One question about the approach do we have to do a combined analysis/rewrite, or can we just do a liveness analysis (with emptying the live set at a proc-point) followed by inserting the reloads? It's not clear why this new machinery is needed.

compiler/cmm/CmmLayoutStack.hs
138

I'm guessing this is not just proc points but also call continuations, it would be good to clarify that.

144–173

Nice!

Nice observation about the unnecessary reloads, and the results look good. I presume with -O the unnecessary reloads will be eliminated by CmmSink, but eliminating them earlier saves compile time and helps with -O0.

Cool, thanks for having a look! Yes, sinking pass would usually remove all those reloads. But the liveness analysis it's doing would be quite a bit slower due to the size of the Cmm.

One question about the approach do we have to do a combined analysis/rewrite, or can we just do a liveness analysis (with emptying the live set at a proc-point) followed by inserting the reloads? It's not clear why this new machinery is needed.

Yes, this shouldn't be necessary now (I experimented with the rewriting in my earlier approach). I'll change it to use just existing analysis pass.
Later we can consider using the rewriting approach to remove dead assignments *and* calculate liveness at the same time.

simonmar requested changes to this revision.May 18 2017, 7:32 AM

Ok, passing back to you for refactoring. I'll take another look when you update the diff.

This revision now requires changes to proceed.May 18 2017, 7:32 AM

Nice observation about the unnecessary reloads, and the results look good. I presume with -O the unnecessary reloads will be eliminated by CmmSink, but eliminating them earlier saves compile time and helps with -O0.

Cool, thanks for having a look! Yes, sinking pass would usually remove all those reloads. But the liveness analysis it's doing would be quite a bit slower due to the size of the Cmm.

One question about the approach do we have to do a combined analysis/rewrite, or can we just do a liveness analysis (with emptying the live set at a proc-point) followed by inserting the reloads? It's not clear why this new machinery is needed.

Yes, this shouldn't be necessary now (I experimented with the rewriting in my earlier approach). I'll change it to use just existing analysis pass.
Later we can consider using the rewriting approach to remove dead assignments *and* calculate liveness at the same time.

Actually, this is not quite accurate. There is a good reason to interleave rewriting and analysis:

  • When adding reloads in a proc-point we need to know what is live in the body of that block and any non-proc-point blocks that are reachable from it.
  • But for propagating the dataflow facts (ie, when analyzing the blocks preceding the current one), we want to say that nothing is live at the entry of a proc-point.

So we actually need to track two things. If we want to avoid interleaving the rewriting and analysis, we would need to create another DataflowLattice that has not only liveness information, but also something additional that would allow us to figure out what regs to reload (eg, we could probably just store the set of registers to reload).

The current approach doesn't need to track this additional info because it generates the reloads immediately and only then propagates the information that nothing is live at proc-point entry.

I'm slightly in favor of rewriting & analyzing at the same time.

What do you think?

Nice observation about the unnecessary reloads, and the results look good. I presume with -O the unnecessary reloads will be eliminated by CmmSink, but eliminating them earlier saves compile time and helps with -O0.

Cool, thanks for having a look! Yes, sinking pass would usually remove all those reloads. But the liveness analysis it's doing would be quite a bit slower due to the size of the Cmm.

One question about the approach do we have to do a combined analysis/rewrite, or can we just do a liveness analysis (with emptying the live set at a proc-point) followed by inserting the reloads? It's not clear why this new machinery is needed.

Yes, this shouldn't be necessary now (I experimented with the rewriting in my earlier approach). I'll change it to use just existing analysis pass.
Later we can consider using the rewriting approach to remove dead assignments *and* calculate liveness at the same time.

Actually, this is not quite accurate. There is a good reason to interleave rewriting and analysis:

  • When adding reloads in a proc-point we need to know what is live in the body of that block and any non-proc-point blocks that are reachable from it.
  • But for propagating the dataflow facts (ie, when analyzing the blocks preceding the current one), we want to say that nothing is live at the entry of a proc-point. So we actually need to track two things. If we want to avoid interleaving the rewriting and analysis, we would need to create another DataflowLattice that has not only liveness information, but also something additional that would allow us to figure out what regs to reload (eg, we could probably just store the set of registers to reload).

Right, this is sort of what I had in mind when I said "just do a liveness analysis (with emptying the live set at a proc-point)". But you're right that you also need to know what's live at the start of each block for reloading.

The current approach doesn't need to track this additional info because it generates the reloads immediately and only then propagates the information that nothing is live at proc-point entry.

I'm slightly in favor of rewriting & analyzing at the same time.

Sure, if that works out more cleanly then it's fine. I was slightly worried about the extra overhead of doing rewriting + analysis since that was always too expensive when I experimented with it before, but perhaps the new framework for doing block-level analysis and rewriting overcomes that.

michalt marked an inline comment as done.Jun 4 2017, 10:14 AM

[...]

The current approach doesn't need to track this additional info because it generates the reloads immediately and only then propagates the information that nothing is live at proc-point entry.

I'm slightly in favor of rewriting & analyzing at the same time.

Sure, if that works out more cleanly then it's fine. I was slightly worried about the extra overhead of doing rewriting + analysis since that was always too expensive when I experimented with it before, but perhaps the new framework for doing block-level analysis and rewriting overcomes that.

Right, even though the core algorithm for rewriting is not all that different compared to the Hoopl way of doing things, I'd still expect some perf win with the block-oriented approach due to a bunch of simplifications (we don't pass around BwdPass containing functions, joinBlocksOO is simpler, etc.) and more work is moved to the analysis definition.

But there is also a pretty large difference in this particular liveness/rewriting pass. The block-oriented interface allows us to avoid rewriting any existing nodes (I'm just using the gen_kill from CmmLive on the existing middle nodes). So if the analysis information gets updated, we pay for:

  1. redoing analysis (which we would have to do anyway even in an analysis only pass)
  2. recomputing/recreating the reloads (but only for proc-points)

So I'd expect the analysis-only approach to be similar in performance (recreating reloads vs passing around two sets of live registers).

In any case, I've run nofib (HEAD from mid May without vs with the change):

  • Compile Times:
-1 s.d.                -----            -1.0%
+1 s.d.                -----            +1.8%
Average                -----            +0.4%
  • Compile Allocations
-1 s.d.                -----            +0.2%
+1 s.d.                -----            +0.6%
Average                -----            +0.4%

So unfortunately, the nofib slightly regresses, but IMHO it's worth getting rid of the quadratic behavior.

compiler/cmm/CmmLayoutStack.hs
138

Yes, I think that's correct - continuation blocks should be included.
But I might be a bit confused by naming here - I thought that a continuation block is also called a proc-point (eg, CmmProcPoint.callProcPoints suggests that any block that is a continuation of CmmCall is a proc-point).
What would be a better way to phrase this? (or a better name?)

144–173

Thanks :)

simonmar accepted this revision.Jun 13 2017, 3:13 AM

I think you need to adjust the bounds on the perf tests again. Did T13379 regress?

compiler/cmm/CmmLayoutStack.hs
138

Yes, I think there's some confusion over naming here. I tend to think of a proc point as something that we're going to compile into a separate procedure, and when using the native code generator we don't have any proc points.

How about "at each proc point and call continuation"

This revision is now accepted and ready to land.Jun 13 2017, 3:13 AM
michalt updated this revision to Diff 12829.Jun 13 2017, 2:54 PM
michalt marked 5 inline comments as done.
  • Update comment and perf test

I think you need to adjust the bounds on the perf tests again. Did T13379 regress?

I'm not able to reproduce on Linux. Let's see if another run also fails...

But I did have to update T13701 - during my local ./validate run it failed ~11% more allocation:

bytes allocated value is too high:
    Expected    T13701(normal) bytes allocated: 2511285600 +/-10%
    Lower bound T13701(normal) bytes allocated: 2260157040 
    Upper bound T13701(normal) bytes allocated: 2762414160 
    Actual      T13701(normal) bytes allocated: 2794760840 
    Deviation   T13701(normal) bytes allocated:       11.3 %
*** unexpected stat test failure for T13701(normal)

But on a number of subsequent runs of just this one test I had to *reduce* the threshold by ~12% to make it pass:

bytes allocated value is too low:
(If this is because you have improved GHC, please
update the test so that GHC doesn't regress again)
    Expected    T13701(normal) bytes allocated: 2511285600 +/-10%
    Lower bound T13701(normal) bytes allocated: 2260157040
    Upper bound T13701(normal) bytes allocated: 2762414160
    Actual      T13701(normal) bytes allocated: 2188756880
    Deviation   T13701(normal) bytes allocated:      -12.8 %
*** unexpected stat test failure for T13701(normal)

I don't have an explanation for that except for some interesting timing effects (I've run validate with many threads, while the single test on an idle system)

michalt updated this revision to Diff 12834.Jun 14 2017, 1:41 PM
  • Revert perf thresholds for T13701
  • Update perf threshold for T13379 on osx
  • Revert perf thresholds for T13701
  • Update perf threshold for T13379 on osx

I've change the threshold for T13379 on OSX. According to the comment in the test, on OSX we already had 9% higher allocation and this patch simply pushes this above the 10% threshold. So this seems to be in line with slight increase in allocation for some programs.

But T13701 is weird. It does seem it's sensitive to the system load (?!). When validating locally with CPUS=8 it fails due to *high* allocation, when validating with CPUS=2 it fails due to too *low* allocation. So for now I've just reverted any changes I did to the thresholds (they should be the same as we currently have in HEAD).

Very mysterious. @bgamari any idea what might cause the strange behaviour that @michalt reports above?

michalt added a subscriber: kavon.Jun 15 2017, 12:45 PM
  • Revert perf thresholds for T13701
  • Update perf threshold for T13379 on osx

I've change the threshold for T13379 on OSX. According to the comment in the test, on OSX we already had 9% higher allocation and this patch simply pushes this above the 10% threshold. So this seems to be in line with slight increase in allocation for some programs.

But T13701 is weird. It does seem it's sensitive to the system load (?!). When validating locally with CPUS=8 it fails due to *high* allocation, when validating with CPUS=2 it fails due to too *low* allocation. So for now I've just reverted any changes I did to the thresholds (they should be the same as we currently have in HEAD).

So I've recompiled GHC without this diff to see if this changes anything (at 0058a3490f which was HEAD on June 9th). And it reproduces - under heavy load the test fails due to too high allocation, run by itself it fails due to too low allocation.
From ./validate on a loaded system:

    Expected    T13701(normal) bytes allocated: 2511285600 +/-10%
    Lower bound T13701(normal) bytes allocated: 2260157040 
    Upper bound T13701(normal) bytes allocated: 2762414160 
    Actual      T13701(normal) bytes allocated: 3127866896 
    Deviation   T13701(normal) bytes allocated:       24.6 %
*** unexpected stat test failure for T13701(normal)

In the same source tree/build, but running by itself on an idle system:

> make TEST=T13701
[...]
    Expected    T13701(normal) bytes allocated: 2511285600 +/-10%
    Lower bound T13701(normal) bytes allocated: 2260157040
    Upper bound T13701(normal) bytes allocated: 2762414160
    Actual      T13701(normal) bytes allocated: 2187784192
    Deviation   T13701(normal) bytes allocated:      -12.9 %
*** unexpected stat test failure for T13701(normal)

So whatever is happening here, it seems it's unrelated to my change.

bgamari edited edge metadata.Jun 16 2017, 3:27 PM

But T13701 is weird. It does seem it's sensitive to the system load (?!). When validating locally with CPUS=8 it fails due to *high* allocation, when validating with CPUS=2 it fails due to too *low* allocation. So for now I've just reverted any changes I did to the thresholds (they should be the same as we currently have in HEAD).

Indeed I've noticed that allocations tend to fluctuate with system load. I sometimes see test failures during parallel validations which then vanish when I run just the failing test alone. It's a rather frustrating thing to reproduce and debug, so I've just accepted that this is the way the world works.

This revision was automatically updated to reflect the committed changes.