Discard reflexive casts during Simplify
ClosedPublic

Authored by tdammers on Feb 7 2018, 4:53 AM.

Details

Summary

Trac Trac #14735 (derived from Trac Trac #11735) found that 75% of compile
time was being spent in simplCast. This patch is the first in a series
to deal with that problem.

This particular patch actually has very little effect on performance; it
just refactors simplCast so that it builds Refl coercions less often.
Refl coercions require us to compute the type to put inside them, and
even if that's done lazily it is still work and code. Instead we use
Maybe Coercion with Nothing for Refl. This change also percolates to
pushCoTyArg and pushValArg.

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
tdammers created this revision.Feb 7 2018, 4:53 AM
bgamari updated the Trac tickets for this revision.Feb 27 2018, 9:23 AM

The git history seems confused here, which is perhaps my fault. I contributed the code in CoreOpt.hs and Simplify.hs, which is relevant for this patch. The other changes are unrelated (and are around caching NthCo roles, which is a separate optimization).

compiler/backpack/RnModIface.hs
668 ↗(On Diff #15392)

This seems unrelated -- I think perhaps it's a git goof.

Looks good. Some comments above.

I have not yet looked at Simplify -- that's next

compiler/backpack/RnModIface.hs
668 ↗(On Diff #15392)

This looks like part of the other patch (caching roles in NthCo)? Or else just a mistake -- I thought we weren't putting the cached role into IfaceNthCo.

compiler/coreSyn/CoreOpt.hs
933–939

Richard and I have agreed to define a type

data MCoercion = MRefl | MCo Coercion

This is isomorphic to Maybe Coercion but useful in a number of places, and super-helpful documentation. (eg MRefl is much more perspicuous than Nothing.

Define this in Coercion.hs I think.

compiler/iface/IfaceType.hs
263 ↗(On Diff #15392)

Same remarks about IfaceNthCo

compiler/types/Coercion.hs
32 ↗(On Diff #15392)

Is this from the other patch? If it's dead code it should certainly go.

compiler/types/OptCoercion.hs
309 ↗(On Diff #15392)

Wrong patch?

Nothing wrong in the simplCast stuff, but it does not do the job I think it needs to (comment:10 of Trac #11735).

I'd be happy to see this committed, but then there is more to do.

Input from Richard needed

compiler/coreSyn/CoreOpt.hs
986

Whey aren't they *both* Maybe Coercion? Really a question for Richard.

compiler/simplCore/Simplify.hs
1209–1210

Richard: this does nothing to address comment:10 of Trac Trac #11735.

I think we will still find bad perf in simplCast.

1212

addMCoerce would be a better name

goldfire added inline comments.Feb 28 2018, 9:30 AM
compiler/coreSyn/CoreOpt.hs
986

At the call side in simplCast#addCoerce, we ask for the kind of the first coercion. With the MCoercion abstraction, this would be more feasible. We could have a mcoercionKind :: Type -> MCoercion -> Pair Type that takes the type of the thing being cast as well as the coercion, returning that type in the MRefl case. As it is, this all seemed too awkward to be worthwhile. (The MCoercion makes it less awkward.)

compiler/simplCore/Simplify.hs
1209–1210

I removed the call to piResultTy, which you claimed was the problem. So I'm not sure what you're worried about here.

The git history seems confused here, which is perhaps my fault. I contributed the code in CoreOpt.hs and Simplify.hs, which is relevant for this patch. The other changes are unrelated (and are around caching NthCo roles, which is a separate optimization).

Hmm, yes, looks like an overly enthusiastic git commit -a or something like that. I'll see if I can untangle things and cherry-pick the intended changes.

tdammers updated this revision to Diff 15796.Mar 20 2018, 5:55 AM

Squash & remove commits that don't belong here

tdammers marked 4 inline comments as done.Mar 20 2018, 6:02 AM
tdammers added inline comments.
compiler/backpack/RnModIface.hs
668 ↗(On Diff #15392)

Looks like someone accidentally committed something that shouldn't go in here. Turns out I could just remove an entire commit (the one that says "fix unit test failures") - but those failures are the result of changes that we have put in D4394, and just happened to end up on the same branch.

compiler/types/Coercion.hs
32 ↗(On Diff #15392)

Yes, it is from that one commit that accidentally got in here.

compiler/types/OptCoercion.hs
309 ↗(On Diff #15392)

Probably, but I'm not 100% sure I picked the correct changes here. @goldfire, could you maybe take a quick look and tell us whether this is what needs to be here?

I believe the relevant ticket is Trac #14737, which focuses on this issue alone.

compiler/simplCore/Simplify.hs
1209–1210

Ah! I'd missed that. Yes, I think that solves the problem I was concerned about.

Tobias: previously simplCast came out top of the cost centres with lots of time (comment:44 of Trac #11735). Does this patch make it go away?

simonpj updated the Trac tickets for this revision.Mar 20 2018, 7:16 AM

Does this indeed solve the massive asymptotic performance problem reported in Trac #11735. Please make sure that the commit message makes this clear. Ideally with a test case...

simonpj accepted this revision.Mar 20 2018, 7:20 AM

And let's get this committed :-).

This revision is now accepted and ready to land.Mar 20 2018, 7:20 AM
tdammers marked 2 inline comments as done.Mar 21 2018, 12:23 PM

Hold back the horses, I've done some additional profiling, "just to make sure", and it turns out that, at least for the Grammar.hs test case that tripped this whole thing off, performance is actually getting over an order of magnitude *worse*, compared to GHC HEAD. I don't think we should land this just yet.

I'm wondering whether I missed something while cherry-picking commits for this patch, or whether it just doesn't work. Or maybe it works, but only when combined with D4394, which is more or less what I had been profiling before.

Hold back the horses, I've done some additional profiling, "just to make sure", and it turns out that, at least for the Grammar.hs test case that tripped this whole thing off, performance is actually getting over an order of magnitude *worse*, compared to GHC HEAD

Sigh. This patch is going on forever!

Try this code in Simplify

 -- If the first parameter is Nothing, then simplifying revealed a
 -- reflexive coercion. Omit.
addCoerce0 :: Maybe OutCoercion -> SimplCont -> SimplM SimplCont
addCoerce0 Nothing   cont = return cont
addCoerce0 (Just co) cont = addCoerce co cont

addCoerce :: OutCoercion -> SimplCont -> SimplM SimplCont

addCoerce co1 (CastIt co2 cont)
  = addCoerce (mkTransCo co1 co2) cont

addCoerce co cont@(ApplyToTy { sc_arg_ty = arg_ty, sc_cont = tail })
  | Just (arg_ty', m_co') <- pushCoTyArg co arg_ty
  = do { tail' <- addCoerce0 m_co' tail
       ; return (cont { sc_arg_ty = arg_ty', sc_cont = tail' }) }

addCoerce co cont@(ApplyToVal { sc_arg = arg, sc_env = arg_se
                         , sc_dup = dup, sc_cont = tail })
  | Just (co1, m_co2) <- pushCoValArg co
  , Pair _ new_ty <- coercionKind co1
  , not (isTypeLevPoly new_ty)  -- without this check, we get a lev-poly arg
                                -- See Note [Levity polymorphism invariants] in CoreSyn
                                -- test: typecheck/should_run/EtaExpandLevPoly
  = do { tail' <- addCoerce0 m_co2 tail
       ; if isReflCo co1
         then return (cont { sc_cont = tail' })
              -- Avoid simplifying if possible;
              -- See Note [Avoiding exponential behaviour]
         else do
       { (dup', arg_se', arg') <- simplArg env dup arg_se arg
            -- When we build the ApplyTo we can't mix the OutCoercion
            -- 'co' with the InExpr 'arg', so we simplify
            -- to make it all consistent.  It's a bit messy.
            -- But it isn't a common case.
            -- Example of use: Trac #995
       ; return (ApplyToVal { sc_arg  = mkCast arg' co1
                            , sc_env  = arg_se'
                            , sc_dup  = dup'
                            , sc_cont = tail' }) } }

addCoerce co cont
  | isReflexiveCo co = return cont
  | otherwise        = return (CastIt co cont)
         -- It's worth checking isReflexiveCo.
         -- For example, in the initial form of a worker
         -- we may find  (coerce T (coerce S (\x.e))) y
         -- and we'd like it to simplify to e[y/x] in one round
         -- of simplification

The main change is putting the isReflexiveCo call at the end (where it used to be) instead of having it at the beginning as well (which seems bizarre). That seems to make Grammar compile in the same time as before.

But I was expecting it to be faster. comment:57 of Trac Trac #11735 shows simplCast right at the top. This patch was supposed to fix it. But apparently not?

The patch is good anyway. But if it doesn't fix the problem, is simplCast still at the top of allocation list? And where is that time going?

But if it doesn't fix the problem, is simplCast still at the top of allocation list? And where is that time going?

Well, the SCC for simplCast is not present in the code as tested, but AFAICT it's very likely:

	Thu Mar 22 15:59 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/D4395/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

	total time  =      249.36 secs   (249362 ticks @ 1000 us, 1 processor)
	total alloc = 346,942,479,184 bytes  (excludes profiling overheads)

COST CENTRE   MODULE    SRC                                        %time %alloc

SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:764:39-74   97.9   98.4

Meanwhile, this is what the base commit in master looks like:

	Thu Mar 22 15:43 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/HEAD/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

	total time  =       20.51 secs   (20513 ticks @ 1000 us, 1 processor)
	total alloc = 24,649,590,120 bytes  (excludes profiling overheads)

COST CENTRE     MODULE     SRC                                                 %time %alloc

SimplTopBinds   SimplCore  compiler/simplCore/SimplCore.hs:756:39-74            73.8   77.8
tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(490,4)-(552,7)      8.8    8.4
CoreTidy        HscMain    compiler/main/HscMain.hs:1257:27-67                   2.7    2.2
writeIface      HscMain    compiler/main/HscMain.hs:1283:9-45                    2.1    0.1
coercionKind    Coercion   compiler/types/Coercion.hs:1698:3-7                   1.5    3.4
zonkTopDecls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(441,16)-(442,43)    1.3    1.3
deSugar         HscMain    compiler/main/HscMain.hs:512:7-44                     1.0    0.8

Both fresh clean builds using the prof build profile.

So it does indeed look like simplCast is still the culprit. I can re-run with the SCC added back in if you need solid proof (though this is rather time consuming).

So it does indeed look like simplCast is still the culprit. I can re-run with the SCC added back in if you need solid proof (though this is rather time consuming).

Are you saying that, *with the code I copy/pasted in the previous comment*, the perf is actually *a lot worse* than before the patch? That seems very implausible; it's such a benign change. (But NB you must use the code I copy/pasted. Your patch has isReflexive twice.)

I'm not astonished if it doesn't fix the original problem; I am astonished that it makes things worse.

Yes, please add that SCC back in, and drill in a bit. I was surprised when simplCast came out top two months ago. It looks as if we still don't understand why. So we need to drill. Thanks!

So it does indeed look like simplCast is still the culprit. I can re-run with the SCC added back in if you need solid proof (though this is rather time consuming).

Are you saying that, *with the code I copy/pasted in the previous comment*, the perf is actually *a lot worse* than before the patch? That seems very implausible; it's such a benign change. (But NB you must use the code I copy/pasted. Your patch has isReflexive twice.)

No, I haven't tried your code yet, my dev machine is still busy with two other builds. I'll try that as soon as these are finished. The performance difference is between a clean GHC HEAD (~20 seconds) vs. the patch as we see it right now (~270 seconds).

I'm not astonished if it doesn't fix the original problem; I am astonished that it makes things worse.

I find that somewhat disheartening myself, and I am actually baffled that GHC HEAD now performs so much better than when we set out to fix this. As a reminder, GHC 8.2.1 takes ~175 seconds to compile Grammar.hs, current HEAD takes ~20 seconds *without any of our patches applied*, and GHC HEAD + this patch applied takes ~270 seconds.

Yes, please add that SCC back in, and drill in a bit. I was surprised when simplCast came out top two months ago. It looks as if we still don't understand why. So we need to drill. Thanks!

Will do.

Are you saying that, *with the code I copy/pasted in the previous comment*, the perf is actually *a lot worse* than before the patch? That seems very implausible; it's such a benign change. (But NB you must use the code I copy/pasted. Your patch has isReflexive twice.)

OK, your patch brings execution time back down to 20 seconds, which is pretty much exactly on par.

I'll add this change to the diff and then dig into the simplCast part.

tdammers updated this revision to Diff 15856.Mar 26 2018, 6:48 AM
  • Fix huge performance regression

On further thought, I think it'd be better to get this landed first, and then look into further improvements.

tdammers updated this revision to Diff 15870.Mar 27 2018, 9:52 AM
  • Discard reflexive casts during Simplify
  • Fix huge performance regression

Wait a tick. Before your most recent commits, did addCoerce check isReflexiveCo twice? That's ridiculous (and my fault, I'm sure). That could be the reason for the slowdown. It might be worth doing performance tests to see if the check is better at the beginning or at the end... but it certainly shouldn't be in both places.

Wait a tick. Before your most recent commits, did addCoerce check isReflexiveCo twice? That's ridiculous (and my fault, I'm sure). That could be the reason for the slowdown.

I think it might, yes, but I don't think it explains all of the slowdown - you'd expect an improvement of 50% or less, not 90% like we do. So it seems there's more going on.

It might be worth doing performance tests to see if the check is better at the beginning or at the end... but it certainly shouldn't be in both places.

I'm doing some more performance-related digging on this matter, but I suggest we keep the discussion about that on Trac (https://ghc.haskell.org/trac/ghc/ticket/14737#comment:7).

tdammers updated this revision to Diff 15921.Apr 3 2018, 5:50 AM

Discard reflexive casts during Simplify

Squashed two commits which, together, should get performance back on par.

tdammers edited the summary of this revision. (Show Details)Apr 3 2018, 10:29 AM

@simonpj, could you quickly review & comment on the updated summary (identical to commit message)?

I suggest this commit message

Simplify simplCast

Trac Trac #14735 (derived from Trac #11735) found that 75% of compile time was being spent in simplCast. This patch is the first in a series to deal with that problem.

This particular patch actually has very little effect on performance; it just refactors simplCast so that it builds Refl coercions less often. Refl coercions require use to compute the type to put in side them, and even if that's done lazily it is still work and code. Insrtead we use Maybe Coercion with Nothing for Refl. This change also percolates to pushCoTyArg and pushValArg.

compiler/coreSyn/CoreOpt.hs
1008

Use funArgTy

tdammers updated this revision to Diff 15939.Apr 5 2018, 5:12 AM

Rewrote commit message & implemented suggested change.

tdammers edited the summary of this revision. (Show Details)Apr 5 2018, 5:13 AM
This revision was automatically updated to reflect the committed changes.