Caching coercion roles in NthCo and coercionKindsRole refactoring
ClosedPublic

Authored by tdammers on Feb 7 2018, 4:46 AM.

Details

Summary

While addressing nonlinear behavior related to coercion roles, particularly NthCo, we noticed that coercion roles are recalculated often even though they should be readily at hand already in most cases. This patch adds a Role to the NthCo constructor so that we can cache them rather than having to recalculate them on the fly. https://ghc.haskell.org/trac/ghc/ticket/11735#comment:23 explains the approach.

Performance improvement over GHC HEAD, when compiling Grammar.hs (see below):

GHC 8.2.1:

ghc Grammar.hs  176.27s user 0.23s system 99% cpu 2:56.81 total

before patch (but with other optimizations applied):

ghc Grammar.hs -fforce-recomp  175.77s user 0.19s system 100% cpu 2:55.78 total

after:

../../ghc/inplace/bin/ghc-stage2 Grammar.hs  10.32s user 0.17s system 98% cpu 10.678 total

Introduces the following regressions:

  • perf/compiler/parsing001 (possibly false positive)
  • perf/compiler/T9872
  • perf/compiler/haddock.base

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
There are a very large number of changes, so older changes are hidden. Show Older Changes
tdammers edited the summary of this revision. (Show Details)Mar 5 2018, 7:04 AM
tdammers edited the test plan for this revision. (Show Details)
tdammers marked 6 inline comments as done.Mar 5 2018, 8:12 AM
tdammers added inline comments.
compiler/types/Coercion.hs
1315

Hmm, it turns out that making this change triggers GHC's unused binding warning, and rightfully so, because the role argument is now redundant. Which means we can further simplify the (n_args, role) ... part to just n_args ....

tdammers updated this revision to Diff 15646.Mar 5 2018, 8:25 AM
tdammers marked an inline comment as done.
  • Fix unused variables
tdammers marked an inline comment as done.Mar 5 2018, 8:27 AM
tdammers added inline comments.
compiler/types/Coercion.hs
1315

AFAICT, this is what should remain after getting rid of the unused role variable. Does this look OK?

simonpj requested changes to this revision.Mar 5 2018, 10:24 AM

LGTM, but I have not re-scanned in detail.

One definite change needed. then land it!

compiler/types/Coercion.hs
1316

I still don't want this terrible case analysis on tc here. And I'm sure there is no need for it.
We want g1, g2 to be the last two elements of co_list.

So, you could use tyConArity tc instead of n_args in the defns of g1, g2

Or maybe even

(g2:g2:_) = reverse co_list

but NOT this multi-branch case analysis

This revision now requires changes to proceed.Mar 5 2018, 10:24 AM
tdammers edited the summary of this revision. (Show Details)Mar 5 2018, 10:33 AM
tdammers updated this revision to Diff 15663.Mar 6 2018, 5:12 AM
tdammers marked an inline comment as done.
  • Refactor mkCoCast
tdammers marked 2 inline comments as done.Mar 6 2018, 5:20 AM
bgamari requested changes to this revision.Mar 7 2018, 7:45 PM

T11195 needs to be updated.

This revision now requires changes to proceed.Mar 7 2018, 7:45 PM
tdammers updated this revision to Diff 15737.Mar 13 2018, 8:09 AM
  • Fix test to reflect changed NthCo ctor
tdammers updated this revision to Diff 15743.Mar 14 2018, 4:57 AM
  • Fixed NthCo Note (discussion in D4385)
  • Documented invariant on cached roles
  • Tighten cached role in NthCo
  • NthCo handling, assertions and documentation improvements
  • Fixes to Coercion.hs
  • Removed dead code
  • Refactored setNominalRole_maybe to avoid dragging role through recursion
  • Another fix as per D4394
  • Fixed unused variables
  • Refactor mkCoCast
  • Fix test to reflect changed NthCo ctor

Rebased onto current master.

@simonpj, you asked me to add the test case where we're seeing massive improvement, but as it is, it still takes 10 seconds to compile, so I don't want to add it to the test suite like that; but the code is Happy output, and I'd rather not have to manually reduce it to something smaller. I could try to hunt down the source and reduce that, or bite the bullet and do it manually anyway. Thoughts?

@simonpj, you asked me to add the test case where we're seeing massive improvement, but as it is, it still takes 10 seconds to compile, so I don't want to add it to the test suite like that; but the code is Happy output, and I'd rather not have to manually reduce it to something smaller. I could try to hunt down the source and reduce that, or bite the bullet and do it manually anyway. Thoughts?

Gah, I don't know. Maybe just give up on the test. However, please do record, in the ticket, the imrpovements from this patch, and from the other simplCast one; being clear about what code you are testing against, what compiler flags etc.

Maybe in the commit messages too.

Hmm, one test now fails on OS X. The relevant test case T5321 allocates 5.5% more now, which is 0.5% more than our set limit of 5%. The build also succeeds on the other platforms, only OS X seems to be having trouble.

So do we investigate? Just accept this (everything else seems to be on par, or faster) and bump the stat? Go back to the drawing board?

Oh bother!

That does come as a surprise. I really thought everything would be faster.

It's an acceptable loss -- in exchange for an asympotic improvement in nasty cases. But I always wonder if there might be money on the table; that is, maybe this loss is because of some accidental error that is easily fixed. If so, it'd be a shame to lose sight of it.

Perhaps spend an hour or three profiling, to see if you can identify any functions that are being called more often than before, or are allocating a lot more than before.

But first check whether the program size increasing (via -dshow-passes); e.g. are the coercions larger? I suppose that could happen if the coercion optimiser was missing an opportunity it found before.

Oh bother!

That does come as a surprise. I really thought everything would be faster.

So did I, and when I compile the relevant file with HEAD and the D4394 version, we see the massive improvement we expect - most passes are at least on par, but the vast majority is about an order of magnitude faster and allocates significantly less, even on this particular test case. I'm afraid I can't test on OS X though, so I don't have a decisive answer for you here.

It's an acceptable loss -- in exchange for an asympotic improvement in nasty cases. But I always wonder if there might be money on the table; that is, maybe this loss is because of some accidental error that is easily fixed. If so, it'd be a shame to lose sight of it.

Perhaps spend an hour or three profiling, to see if you can identify any functions that are being called more often than before, or are allocating a lot more than before.

But first check whether the program size increasing (via -dshow-passes); e.g. are the coercions larger? I suppose that could happen if the coercion optimiser was missing an opportunity it found before.

-dshow-passes doesn't reveal any losses or obvious problems:

  • Sizes (in terms of terms, types, coercions and joins) remain identical
  • Time remains the same or gets better, sometimes drastically so
  • Allocations remain the same or get better, dito

Then again, the test specification says "only_ways(['normal']), # no optimisation for this one", does that mean it uses a compiler built without optimisation?

when I compile the relevant file with HEAD and the D4394 version, we see the massive improvement we expect

Are you saying that the 5.5% allocation lossage shows up when you run the testsuite, but NOT when you run this test individually? That really is mysterious. Or are you saying something bad only happens on OSX which you do not have access to?

when I compile the relevant file with HEAD and the D4394 version, we see the massive improvement we expect

Are you saying that the 5.5% allocation lossage shows up when you run the testsuite, but NOT when you run this test individually? That really is mysterious. Or are you saying something bad only happens on OSX which you do not have access to?

The latter - I do not have an OS X machine available, and I only see the stat failure in the CI job for OS X. The Linux and Windows CI jobs are succeeding, and when I run things on my own setup (Debian), I get the expected improvements, and I cannot reproduce the stat failure. It doesn't matter whether I run the whole test suite or just this one test.

OK well then I suggest you just commit and move on. I suppose you may have to make an unexplained OSX-only bump in the test, pending someone having access to OSX. I don't want you to burn too much time on this

tdammers updated this revision to Diff 15752.Mar 17 2018, 12:05 PM
  • Increased allocation limit for T5321 on OS X
tdammers updated this revision to Diff 15797.Mar 20 2018, 7:00 AM
  • Fixed NthCo Note (discussion in D4385)
  • Documented invariant on cached roles
  • Tighten cached role in NthCo
  • NthCo handling, assertions and documentation improvements
  • Fixes to Coercion.hs
  • Removed dead code
  • Refactored setNominalRole_maybe to avoid dragging role through recursion
  • Another fix as per D4394
  • Fixed unused variables
  • Refactor mkCoCast
  • Fix test to reflect changed NthCo ctor
  • Increased allocation limit for T5321 on OS X

This should apply cleanly onto GHC HEAD (as of today)

simonpj accepted this revision.Mar 20 2018, 7:13 AM

I have not re-reviewed in detail, but let's just commit this.

Can you make sure that the commit message contains some data on perf changes. Eg some huge improvements.

tdammers updated this revision to Diff 15801.Mar 20 2018, 11:46 AM
  • Fix rebasing mistake
tdammers updated this revision to Diff 15810.Mar 22 2018, 6:52 AM

Cache coercion roles in NthCo and refactor coercionKindsRole

Instead of recalculating coercion roles for NthCo on the fly, store them in the
NthCo constructor.

Performance improvement over GHC HEAD, when compiling Grammar.hs from Trac
Trac #14683:

GHC 8.2.1:

ghc Grammar.hs 176.27s user 0.23s system 99% cpu 2:56.81 total

before patch (but with other optimizations applied):

ghc Grammar.hs -fforce-recomp 175.77s user 0.19s system 100% cpu 2:55.78 total

after:

../../ghc/inplace/bin/ghc-stage2 Grammar.hs 10.32s user 0.17s system 98% cpu 10.678 total

At @tdammers's request, I've taken a look to see if I can spot why this would degrade performance. Here are some thoughts:

  • tcIfaceCoercion might be the cause of trouble. Deserializing nested NthCos might be quadratic.
  • optCoercion might be the cause of trouble with nested NthCos, which might be quadratic or worse. You can disable coercion optimization with -fno-opt-coercion. Does that make a difference?
  • Caching can conceivably cause extra work than what was done previously. But I don't think it would be much. You can try removing the r field from seqCo to see if that makes a difference.
  • Does adding all the SCCs change performance?
  • It's always conceivable that the optimizer misses some opportunity given the change in code, but nothing jumps out at me in this regard.

I hope this helps!

why this would degrade performance.

Where is the data that says this patch degrades performance? The data in the previous comment shows a huge improvement.

tcIfaceCoercion might be the cause of trouble

I suppose that is just about possible; if so we can just serialise the cached role in IfaceNthCo.

why this would degrade performance.

Where is the data that says this patch degrades performance? The data in the previous comment shows a huge improvement.

tcIfaceCoercion might be the cause of trouble

I suppose that is just about possible; if so we can just serialise the cached role in IfaceNthCo.

bgamari requested changes to this revision.Mar 25 2018, 7:10 PM

I suppose that is just about possible; if so we can just serialise the cached role in IfaceNthCo.

Indeed this sounds worth a try. Requesting changes while @tdammers gives this a shot.

This revision now requires changes to proceed.Mar 25 2018, 7:10 PM

Where is the data that says this patch degrades performance? The data in the previous comment shows a huge improvement.

I'm very sorry about the confusion here; I emailed Richard about performance problems in D4395 (which actually do exist, and are described there; I have some concrete timing results that I will paste in a minute); but unfortunately I typo'd D4395 as D4394 in that message.

So the tenfold increase in execution time refers to D4395, not this patch here. The reason we cannot land D4394 yet is because I'm seeing inexplicable CI failures, as evidenced here.

I suppose that is just about possible; if so we can just serialise the cached role in IfaceNthCo.

Indeed this sounds worth a try. Requesting changes while @tdammers gives this a shot.

We have no evidence that it's a problem in practice. Tobais says that there is no perf issue with this patch. Once the CI thing is sorted, let's just land this. It's gone on much too long!

tdammers updated this revision to Diff 15871.Mar 27 2018, 10:14 AM
  • Fixed NthCo Note (discussion in D4385)
  • Documented invariant on cached roles
  • Tighten cached role in NthCo
  • NthCo handling, assertions and documentation improvements
  • Fixes to Coercion.hs
  • Removed dead code
  • Refactored setNominalRole_maybe to avoid dragging role through recursion
  • Another fix as per D4394
  • Fixed unused variables
  • Refactor mkCoCast
  • Fix test to reflect changed NthCo ctor
  • Increased allocation limit for T5321 on OS X
  • Fix rebasing mistake

Looks like I broke something when rebasing onto master.

More specifically, I broke something, but e3dbb44f53b2f9403d20d84e27f187062755a089 also did - that commit (currently in master) removes the nextRole export from TcType.hs, which this patch relies on.

tdammers updated this revision to Diff 15894.Mar 30 2018, 6:23 AM
  • Fix rebasing-induced failures

Apparently, the way I tried to fix rebasing-induced errors wasn't correct, but unfortunately I don't understand the code well enough to see where I went wrong. Maybe @goldfire or @simonpj could take a peek?

Apparently, the way I tried to fix rebasing-induced errors wasn't correct, but unfortunately I don't understand the code well enough to see where I went wrong. Maybe @goldfire or @simonpj could take a peek?

I'm lost. What is going wrong?

I suppose I can check out wip/tdammers/D4395-new, build it, and try to fix it, but in what way does it not work?

I suppose I can check out wip/tdammers/D4395-new, build it, and try to fix it, but in what way does it not work?

See the CI build output. Specifically, this compiler error:

ghc-stage1: panic! (the 'impossible' happened)
  (GHC version 8.5.20180330 for x86_64-unknown-linux):
ASSERT failed! file compiler/types/Coercion.hs, line 845

So it seems that the good_call assertion in mkNthCo fails while compiling stage2.

mkNthCo r n co
	​  = ASSERT(good_call)

But I don't know why, or how to go about figuring it out.

Oh, and BTW, this is D4394 and has its own branch, wip/tdammers/D4394. The error can be reproduced by running validate on this branch.

Oh, and BTW, this is D4394 and has its own branch, wip/tdammers/D4394. The error can be reproduced by running validate on this branch.

I have fixed the bug and pushed to wip/tdammers/D4394

I have fixed the bug and pushed to wip/tdammers/D4394

I see that there are two commits: one that fixes the bug, and one that moves things around and adds a HasCallStack for debugging purposes. Should the "debug only" commit go into the patch as well, or should I filter it out before submitting?

Should the "debug only" commit go into the patch as well, or should I filter it out before submitting?

Make it a separate patch -- but do commit it. It's generally a good thing

tdammers updated this revision to Diff 15940.Apr 5 2018, 6:43 AM
  • Fix bug in invocation of decomposeFunCo

Separate patch with "debug only" improvements in D4570.

tdammers updated this revision to Diff 15996.Apr 11 2018, 6:31 AM
  • Bump perf test targets
tdammers updated this revision to Diff 16004.Apr 12 2018, 10:42 AM
  • Added SCCs to hunt down Trac #14683
  • Improve coercionKind(Role) perfomance
  • Cache coercion roles in NthCo
bgamari edited the summary of this revision. (Show Details)Apr 16 2018, 10:13 AM

Trac Trac #15019 lists three test cases that compile more slowly with this patch

tdammers edited the summary of this revision. (Show Details)Apr 19 2018, 9:14 AM
tdammers edited the test plan for this revision. (Show Details)
tdammers updated this revision to Diff 16121.Apr 20 2018, 3:48 AM
  • Add SCCs to hunt down Trac #14683
  • Add regression test (Grammar.hs from Trac #14683)
  • Improve coercionKind(Role) perfomance
  • Cache coercion roles in NthCo
This revision was not accepted when it landed; it landed in state Needs Review.Apr 20 2018, 9:32 AM
This revision was automatically updated to reflect the committed changes.