Add module ParUpsweep, and leverage it in GhcMake.parUpsweep to enable better parallelism
AbandonedPublic

Authored by duog on Aug 9 2017, 11:50 PM.

Details

Reviewers
bgamari
ezyang
austin
Trac Issues
#14095
Summary

This is a work in progress. There are many "TODO DOUG"s scattered through the
code, however nearly all of them are for small refactorings, improvements, or
reminders for documentation.

Thank you in advance for reviewing such a large patch! If you have any ideas for
better names, or anything else, please do share. I think comments on the
design should go on the ticket.

Test coverage of this code is not very good, because few tests exercise the
parUpsweep code path. I intend to rectify this in another diff, as well as add
some unit and (maybe) property tests for ParUpsweep.

The main challenge remaining is to properly separate typechecking and
simplifying, see the comment in HscMain.finish (awkwardly that comment is in a
previous diff). As is, because typechecking depends on simplification of all
dependencies (see GhcMake.hs:966), every module will always typecheck
after its dependencies have had their full simplified interfaces in the home
package table; so there isn't currently any gain in parallelism from having
a typechecking state.

The separation of parsing is not usually particulary useful, but I do expect it
to improve compilations with -fno-code.

I have done only cursory benchmarking all on a four core machine. Benchmarking
is done with the script in the gist below, with ghcs built with this commit and
with cb6cf3c53fbfae8fc2907660b91cfc14da775865, both built with the build.mk in
the gist. https://gist.github.com/duog/2462bc4317b3aceb2bcf288be7e261af
The script times cabal installing lens and all its dependencies into a sandbox,
giving cabal "-j1" and ghc "-j4 +RTS -A128m -RTS". So it does include
cabal configuring all the packages, as well as copying and registering.

This commit:
real 2m13.584s
user 4m10.932s
sys 0m13.040s
82652 .cabal-sandbox

cb6cf3c:
real 2m44.675s
user 4m11.476s
sys 0m12.980s
82656 .cabal-sandbox

About a 12% improvement in total time. Note that this test disables cabal's
parallelism, so we can't expect anything like that for a real cabal install lens.

I will be doing some better benchmarking, including of -fno-code, and will update
this summary when that data is available. I expect the effect to have a large
variance over packages, for example the compile time of lens itself seems to be
unaffected by this commit.

  • Commit message:

We now schedule work during a parallel upsweep very differently. There is still
one worker thread per module, however the workers are scheduled through the
new logic in ParUpsweep. They will block after completing:

  • parsing
  • simplifying

Machinery is present to block after desugaring as well, however this is not
leveraged. See the comment in HscMain.finish.

The earlier changes to move hsc_HPT from a HomePackageTable to an
IORef PackageTable, and to add the HscYield callbacks enable this blocking.

In addition to more fine-grained parallelism, the workers are scheduled by
priority to attempt to unblock as much work as possible. The estimate of
required work per task is currently very simple, see
GhcMake.calculateMakePriority.

There is some change in test output, because more modules will begin to compile
than before. See the change to T14075.stdout.

This code is not very well covered by the test suite, as many tests do not run
with -j >1, and so do not enter the parUpsweep code path. A future commit will
address this.

duog created this revision.Aug 9 2017, 11:50 PM
duog retitled this revision from Add MakeQueue, and leverage it in parUpsweep to enable more parallelism to Add module ParUpsweep, and leverage it in GhcMake.parUpsweep to enable better parallelism.Aug 9 2017, 11:52 PM
duog updated the Trac tickets for this revision.
duog added a subscriber: alexbiehl.
duog updated this revision to Diff 13475.Aug 10 2017, 2:25 AM

Fix lints.
Fix haddock problems.

duog updated this revision to Diff 13476.Aug 10 2017, 3:13 AM

Bump max_bytes_used for T1969 for now, to get that green tick.

duog edited the summary of this revision. (Show Details)Aug 10 2017, 1:22 PM
ezyang added a subscriber: ezyang.Aug 13 2017, 12:00 PM

It will take me some time to review this patch, but there are two things I would think about. First, consider using threadscope or a similar tool to get traces showing how long parsing/typechecking/codegen take as "bars" (think similar to nvrtc; Google it) which will help you get a qualitative perspective on how much your patch is helping. If you need a machine with more cores to test on I can volunteer one. Second, take care on whether or not all the thunks are being forced between stages, or if you're actually foisting work into a later phase (which can defeat parallelization.)

duog added a comment.Aug 13 2017, 5:36 PM

It will take me some time to review this patch, but there are two things I would think about. First, consider using threadscope or a similar tool to get traces showing how long parsing/typechecking/codegen take as "bars" (think similar to nvrtc; Google it) which will help you get a qualitative perspective on how much your patch is helping.

Yes, I have the beginnings of something like this, writing to the eventlog in the hscYield handlers. I've just been parsing it with ghc-events and putting the numbers in a database.

If you need a machine with more cores to test on I can volunteer one.

Yes! That would be great.

Second, take care on whether or not all the thunks are being forced between stages, or if you're actually foisting work into a later phase (which can defeat parallelization.)

This is a good point, but it's quite tricky. The types to force are HsParsedSource and either ModDetails or ModIface. None have NFData instances :(
I can use Outputable for HsParsedSource, and Binary for ModIface. I do worry that pretty printing an HsParsedSource might be in the same order of expense as parsing it though.

compiler/main/GhcMake.hs
813

s/it's/its There are many instances of this.

1094

unnecessary liftIO

@ezyang, just a friendly reminder that this needs review.

I'm not done reviewing, but let me add some initial comments.

compiler/main/GhcMake.hs
829

There is a map to ParUpsweepBuildModuleInfo here, but inside ParUpsweepState, there is also a map to BuildModuleState? What's the difference? It wasn't clear to me what logically these two data structures represent.

841

This will report an inexhaustive warning on GHC 7.10, and I think our support window still includes this release of GHC. Do an actual case on scc to eliminate it.

duog added a comment.Sep 26 2017, 3:19 PM

Thanks for your comments. I had lost steam on this a little bit. I still intend to:

  • Clean up all the TODOs
  • Get more tests to run parUpsweep
  • Produce some better benchmarks
compiler/main/GhcMake.hs
829

ParUpsweepState is abstract in GhcMake, only the type is exported from ParUpsweep. It is updated as the upsweep runs.

ParUpsweepBuildModuleInfo is a convenience data structure, with sundry per-BuildModule information that GhcMake.parUpsweep needs, that is computed when building the task graph.

Will add comments to this effect.

ezyang requested changes to this revision.Sep 30 2017, 4:54 PM

Had an offline conversation with duog about this patch. There is some delicate coroutining going on between the main scheduler thread of the new parallel compilation phase, and the "yield" mechanism inside HscMain. There should be a note on this.

On the other hand, I think I understand the strategy in this patch and think it is pretty good. duog is going to look into further hiding some of the complexity by introducing some sort of Shake style interface for declaring dependencies (rather than GhcMake knowing about the MVar dispatching for pausing and resuming builder threads), but I'm happy to merge this patch after the IORef HPT dependence is removed and the basic flow of the coroutines is documented in a note.

I also wrote some documentation while we were discussing the patch which is here: https://phabricator.haskell.org/P157

compiler/main/ParUpsweep.hs
753

It's a bit irritating we can't use 'async' to avoid reimplementing logic like this.

This revision now requires changes to proceed.Sep 30 2017, 4:54 PM

I mentioned this changeset to simonpj, and his thought was 12% perf improvement was not all that much for a decent increase in code complexity (2x speed up would be much better). Mentioning it on this ticket so I don't forget.

austin resigned from this revision.Nov 9 2017, 5:37 PM
duog abandoned this revision.Jan 1 2018, 4:29 PM

I'm going to rework this. We can get most of the benefit by reorganising DriverPipeline a bit to allow GhcMake to parallelise hscIncrementalCompile and hscGenHardCode.