[RFC] Introduce -ddump-mod-graph
Changes PlannedPublic

Authored by bgamari on Aug 29 2017, 3:32 PM.

Details

Reviewers
austin
Summary

This dumps a JSON representation of the module import graph, similar to the -M
option except it includes modules outsite the home package. This is for
utilities like packunused.

bgamari created this revision.Aug 29 2017, 3:32 PM
bgamari retitled this revision from Introduce -ddump-mod-graph to [RFC] Introduce -ddump-mod-graph.Aug 29 2017, 3:32 PM
hvr awarded a token.Aug 29 2017, 3:46 PM
hvr added a subscriber: hvr.Aug 30 2017, 9:11 AM

Does this also include dependencies added via TH's addDependentFile?

bgamari added a subscriber: duncan.Sep 10 2017, 8:03 AM

Does this also include dependencies added via TH's addDependentFile?

I spoke to @duncan about this and he said that he would rather not take this information from upsweep. Rather, I think he prefers to have a GHC mode where you can provide a list of modules on the command line and have the language extensions and dependencies of those modules emitted in a structured format. The goal here is to allow Cabal to parallelize compilation with module-level granularity.

I don't recall whether he also wanted the dependencies of the transitive closure of the given modules as well. It seems like it would be necessary to include these for Cabal to provide accurate error messages in the case of modules missing from other-modules.

This sort of information is already efficiently collected in GHC's HeaderInfo module, which only needs to partially parse the module. However, this direction would naturally preclude the ability to account for addDependentFile.

So it will be sort of like -M, but faster and without support for addDependentFile? The reason we don't use -M in Cabal is that running both -M and --make in succession slows down the compilation too much for large projects. What would be cool is if ghc --make/-ddump-module-graph could serialise and reuse the compilation graph info between runs, like https://github.com/ezyang/ghc-shake does. I think that supporting addDependentFile would be less of a problem then, plus it'd give us faster rebuild checking.

The problem with implementing module-level parallelism on the Cabal level is that running a lot of separate ghc -c processes in parallel is slower than ghc --make -j because there is no shared interface file cache in the former case. So maybe it'd be easier to make multiple parallel ghc --make -j processes to coordinate better via a shared semaphore for limiting parallelism. I have an old unfinished patch set for that: https://github.com/23Skidoo/cabal/commits/num-linker-jobs.

Good point about emitting language extensions info, that'd be useful.

What would be cool is if ghc --make/-ddump-module-graph could serialise and reuse the compilation graph info between runs, like https://github.com/ezyang/ghc-shake does.

This is precisely what this patch implements.

However, if I understand @duncan correctly, this is not what he had in mind.

The problem with implementing module-level parallelism on the Cabal level is that running a lot of separate ghc -c processes in parallel is slower than ghc --make -j because there is no shared interface file cache in the former case.

Right, I was also concerned about this. However, @duncan has said that the cost of multiple processes wasn't prohibitive in his experience.

What would be cool is if ghc --make/-ddump-module-graph could serialise and reuse the compilation graph info between runs, like https://github.com/ezyang/ghc-shake does.

This is precisely what this patch implements.

Hmm, I was under impression that it only took care of the "serialise" part, not the "load & reuse" part that would give us Shake-style fast recompilation checks.

Right, I was also concerned about this. However, @duncan has said that the cost of multiple processes wasn't prohibitive in his experience.

Well, I think that at this point we need some numbers to decide the best way forward.

What would be cool is if ghc --make/-ddump-module-graph could serialise and reuse the compilation graph info between runs, like https://github.com/ezyang/ghc-shake does.

This is precisely what this patch implements.

Hmm, I was under impression that it only took care of the "serialise" part, not the "load & reuse" part that would give us Shake-style fast recompilation checks.

Ahh, I see. Right, this patch definitely does not implement that. That being said, my assumption would be that construction of the module graph should be relatively cheap. Do you have evidence otherwise?

Right, I was also concerned about this. However, @duncan has said that the cost of multiple processes wasn't prohibitive in his experience.

Well, I think that at this point we need some numbers to decide the best way forward.

Indeed.

Hmm, I was under impression that it only took care of the "serialise" part, not the "load & reuse" part that would give us Shake-style fast recompilation checks.

That being said, my assumption would be that construction of the module graph should be relatively cheap. Do you have evidence otherwise?

Well, the aforementioned https://github.com/ezyang/ghc-shake advertises faster recomp checking as one of its main advantages. I can give you numbers on how much it's faster if you want, but it'll take me a bit of time. Maybe @ezyang can also chime in.

I like to cite the README here: https://github.com/ndmitchell/ghc-make for demonstrating why *anything* is better than what we have in ghc --make today. It's pretty simple, really: ghc --make *always* parses the module headers of every module in your package, no ifs ands or buts. A shake based approach can do a single database load, and a quick probe of the timestamps of all files, before deciding no recompilation must be done.

bgamari planned changes to this revision.Nov 2 2017, 10:49 AM

Bumping out of review queue. I would like to continue this, but I think we need a clearer picture of the desired design.