Don't use a generic apply thunk for known calls
ClosedPublic

Authored by sgraf on Dec 5 2018, 8:59 AM.

Details

Summary

Currently, an AP thunk like sat = f a b c will not have its own entry
point and info pointer and will instead reuse a generic apply thunk
like stg_ap_4_upd.

That's great from a code size perspective, but if f is a known
function, a specialised entry point with a plain call can be much faster
than figuring out the arity and doing dynamic dispatch.

This looks at fs arity to figure out if it is a known function and if so, it
will not lower it to a generic apply function.

Benchmark results are encouraging: No changes to allocation, but 0.2% less
counted instructions.

Test Plan

Validates locally

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
sgraf created this revision.Dec 5 2018, 8:59 AM
sgraf added a comment.Dec 5 2018, 9:15 AM

Binary sizes went up by 0.1%, while these were the biggest winners (> 2%) wrt. instruction count:

cryptarithm1-2.5%
lcss-2.3%
paraffins-3.8%
wheel-sieve2-3.4%
Min-3.8%
Max+0.0%
Geometric Mean-0.2%
osa1 added a comment.Dec 5 2018, 10:34 AM

This looks good to me. Because this trades binary size for performance should this maybe behind -O or -O2?

I'm wondering about the binary size changes in the benchmarks you highlighted. Could you also share binary size changes for those? If this can buy us -3.8% runtime for +0.1% binary size I think this is worth doing. What are the smallest and largest increase in binary size in nofib?

compiler/codeGen/StgCmmBind.hs
314

I think this would be more clear:

idArity fun_id == unknownArity
sgraf marked an inline comment as done.Dec 5 2018, 10:48 AM
In D5414#149260, @osa1 wrote:

This looks good to me. Because this trades binary size for performance should this maybe behind -O or -O2?

I'm wondering about the binary size changes in the benchmarks you highlighted. Could you also share binary size changes for those? If this can buy us -3.8% runtime for +0.1% binary size I think this is worth doing. What are the smallest and largest increase in binary size in nofib?

It's +0.1% for every program basically, with the exception of these:

ProgramSizeInstrs
anna+0.3%-0.2%
expert+0.2%-0.0%
fluid+0.2%-0.1%
grep+0.2%-0.0%
infer+0.2%-0.4%
last-piece+0.2%-0.1%
lift+0.2%-0.0%
paraffins+0.2%-3.8%
prolog+0.2%-0.1%
scs+0.3%-0.0%
transform+0.2%-0.2%
veritas+0.2%+0.0%
Min+0.1%-3.8%
Max+0.3%+0.0%
Geometric Mean+0.1%-0.2%

I guess we could hide it behind -O2, but there are still some nice gains if applied to hot loops.

compiler/codeGen/StgCmmBind.hs
314

Neat, didn't know of that constant.

sgraf updated this revision to Diff 19022.Dec 5 2018, 10:49 AM
sgraf marked an inline comment as done.
  • Use unknownArity instead of 0
simonpj accepted this revision.Dec 5 2018, 5:11 PM

Looks good to me. Please create a ticket, though, describe the idea, and put the perf-change data in it.
Then you can refer to the ticket from the one-line code change.

This revision is now accepted and ready to land.Dec 5 2018, 5:11 PM
sgraf retitled this revision from Don't use a generic apply function for known calls to Don't use a generic apply thunk for known calls.Dec 6 2018, 9:26 AM
sgraf updated the Trac tickets for this revision.Dec 6 2018, 9:39 AM
This revision was automatically updated to reflect the committed changes.

Yet another place where we want a way for the user to say whether they want to tune for size or speed, like gcc's -Os flag.

sgraf added a comment.Dec 7 2018, 3:13 AM

I created Trac #16007 to track opportunities for code size reduction, should we eventually implement -Os.