ErrUtils: Add timings to compiler phases
ClosedPublic

Authored by bgamari on Feb 27 2016, 4:43 PM.

Details

Summary

This adds timings and allocation figures to the compiler's output when run with
-v2 in an effort to ease performance analysis.

Todo:

  • Documentation
  • Where else should we add these?
  • Perhaps we should remove some of the now-arguably-redundant showPass occurrences where they are
  • Must we force more?
  • Perhaps we should place this behind a -ftimings instead of -v2
Test Plan

ghc -v2 Test.hs, look at the output

Diff Detail

Repository
rGHC Glasgow Haskell Compiler
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
bgamari updated this revision to Diff 6792.Feb 27 2016, 4:43 PM
bgamari retitled this revision from to ErrUtils: Add timings to compiler phases.
bgamari updated this object.
bgamari edited the test plan for this revision. (Show Details)
bgamari added a reviewer: simonmar.
bgamari updated this object.Feb 28 2016, 5:55 AM
bgamari edited edge metadata.
hvr edited edge metadata.Feb 28 2016, 6:41 AM

is there an example output you can show us?

Compiling a simple "Hello World" with -v2 looks like this

*** Chasing dependencies:
Chasing modules from: *hi.hs
Stable obj: []
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = 2016-02-27 22:14:33.634387223 UTC
         ms_mod = Main,
         ms_textual_imps = [(Nothing, Prelude)]
         ms_srcimps = []
      }]
*** Deleting temp files:
compile: input file hi.hs
*** Checking old interface for Main:
[1 of 1] Compiling Main             ( hi.hs, hi.o )
*** Parser:
!!! Parser: finished in 1.57e-4 seconds, allocated 85568 bytes
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 15, types: 9, coercions: 0}
!!! Desugar: finished in 5.04e-4 seconds, allocated 241848 bytes
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 19, types: 11, coercions: 0}
Result size of Simplifier = {terms: 19, types: 11, coercions: 0}
!!! Simplifier: finished in 3.22e-4 seconds, allocated 203448 bytes
*** Tidy Core:
Result size of Tidy Core = {terms: 19, types: 11, coercions: 0}
Created temporary directory: /tmp/ghc13332_0
*** CorePrep:
Result size of CorePrep = {terms: 23, types: 16, coercions: 0}
!!! CorePrep: finished in 2.0e-4 seconds, allocated 85488 bytes
*** Stg2Stg:
*** CodeGen:
!!! CodeGen: finished in 1.966e-3 seconds, allocated 1804696 bytes
*** Assembler:
Upsweep completely successful.
bgamari updated this object.Feb 28 2016, 10:11 AM
bgamari edited edge metadata.
ezyang added a subscriber: ezyang.Feb 28 2016, 2:22 PM
  • How does this interact with laziness (similar to should we force more, but the answer to that question is no). In particular, I know some of these passes are streamed.
  • Machine readability?
In D1959#57561, @ezyang wrote:
  • How does this interact with laziness (similar to should we force more, but the answer to that question is no). In particular, I know some of these passes are streamed.

Well, this is one of the concerns that I was trying to make reference to with my "should we force more" question. Yes, in particular native code generation is a bit tricky (that is Cmm -> native code). I believe I've dealt with this case about as well as we can; the entire transform from STG to native is one giant withTiming block (see HscMain.hs). Currently the cost of going from Core -> STG isn't accounted for, but doing so would require forcing the resulting STG which I wasn't sure would be worth it.

With respect to the Core-to-Core passes, their results are already forced by the diagnostics output report core sizes.

  • Machine readability?

Yeah, this is a fair point. I wasn't really sure how much I should sacrifice human readability for ease-of-parsing. Would you like to propose a syntax?

simonmar requested changes to this revision.Feb 29 2016, 9:42 AM
simonmar edited edge metadata.

Great! I suggest combining withTiming and showPass, because then at least we'll be sure we didn't forget one or the other.

Documentation

Mention it under -dshow-passes.

Perhaps we should place this behind a -ftimings instead of -v2

I don't think another flag is necessary here.

This revision now requires changes to proceed.Feb 29 2016, 9:42 AM
simonmar edited edge metadata.Feb 29 2016, 10:11 AM
simonmar added a subscriber: niteria.

Great! I suggest combining withTiming and showPass, because then at least we'll be sure we didn't forget one or the other.

The only issue is that there are a few places where showPass is used where we don't really do any amount of work (or, rather, we don't force the result; e.g. TidyPgm.mkBootModDetailsTc).

Documentation

Mention it under -dshow-passes.

Sure.

Perhaps we should place this behind a -ftimings instead of -v2

I don't think another flag is necessary here.

Alright.

bgamari updated this revision to Diff 6919.Mar 13 2016, 6:33 AM
bgamari edited edge metadata.
  • Add more withTiming sites
  • Kill CoreLint.showPass
  • Documentation
michalt added inline comments.
compiler/main/ErrUtils.hs
469

I'm probably missing something, but it seems that force_result is not actually used by the call sites (it's just const ())..?

bgamari added inline comments.Mar 13 2016, 10:26 AM
compiler/main/ErrUtils.hs
469

Nope, you have not missed anything. Indeed it's currently unused. I'm currently a bit undecided whether this should be the case; the point of this argument is to allow the result to be forced only when timings are enabled, eliminating any performance cost in the non-timed case. It turned out that none of the call sites obviously required any additional forcing (although I need to look more closely at a few of them).

michalt added inline comments.Mar 13 2016, 1:41 PM
compiler/main/ErrUtils.hs
469

Oh, I see. One option would be to just provide a wrapper of withTiming that uses const (). In any way, thanks for explanation!

bgamari updated this revision to Diff 6924.Mar 13 2016, 3:31 PM
bgamari edited edge metadata.
  • Add more withTiming sites
  • Kill CoreLint.showPass
  • Documentation
  • Output timings in milliseconds
  • Show a bit more precision
  • ErrUtils: Less precision in allocation reporting
simonmar added inline comments.Mar 15 2016, 4:19 AM
compiler/main/ErrUtils.hs
476–491

Sorry I didn't notice this before, but there's a much better way to count allocations: https://phabricator.haskell.org/diffusion/GHC/browse/master/libraries/base/GHC/Conc/Sync.hs;de4df6b41a227f527e9eb77733cd6c87b069d3d0$177-206

(I think this only works in GHC >= 7.10 though)

simonmar requested changes to this revision.Mar 15 2016, 4:20 AM
simonmar edited edge metadata.
This revision now requires changes to proceed.Mar 15 2016, 4:20 AM
bgamari added inline comments.Mar 15 2016, 4:45 AM
compiler/main/ErrUtils.hs
476–491

Ahh, great. No problem!

bgamari updated this revision to Diff 6975.Mar 20 2016, 5:56 PM
bgamari edited edge metadata.

Use getAllocationCounter

bgamari updated this revision to Diff 6976.Mar 20 2016, 6:22 PM
bgamari edited edge metadata.

Fix it

simonmar accepted this revision.Mar 21 2016, 5:34 AM
simonmar edited edge metadata.

Great! Please fix the long lines though.

This revision is now accepted and ready to land.Mar 21 2016, 5:34 AM
bgamari planned changes to this revision.Mar 21 2016, 2:35 PM

Going to add module names where possible.

bgamari updated this revision to Diff 7011.Mar 23 2016, 9:07 AM
bgamari edited edge metadata.

Report more context in timing output

This revision is now accepted and ready to land.Mar 23 2016, 9:07 AM

I should note that the streaming nature of the C-- passes makes the output a bit messy; in particular CmmLint is run for each CmmGroup and consequently we see many timings for this pass. It seems like we would ideally aggregate these costs, although this could get quite ugly to implement.

I'll try fixing the long lines next.

bgamari updated this revision to Diff 7013.Mar 23 2016, 10:08 AM
bgamari edited edge metadata.

Fix long lines

This revision was automatically updated to reflect the committed changes.
bgamari edited edge metadata.Mar 24 2016, 6:47 AM
bgamari updated the Trac tickets for this revision.
niteria added inline comments.Mar 9 2018, 11:10 AM
compiler/main/ErrUtils.hs
500

getCPUTime probably works when using a single thread, but when used with multiple threads, the time flows faster (num_threads times faster in the extreme case) making the results hard to interpret

bgamari added inline comments.Mar 9 2018, 12:05 PM
compiler/main/ErrUtils.hs
500

Oh dear, quite right. It would be nice if we had an analogue to getCPUTime that would report per-thread statistics. With pthread_getcpuclockid and some RTS instrumentation this should be possible.