hadrian: optimise Rules.Compile

Authored by alpmestan on Dec 4 2018, 2:58 PM.



Previously, as reported in Trac #15938, resuming a build "in the middle",
e.g when building _build/stage1/libraries/base/, hadrian would take up to
a whole minute to get started doing actual work, building code.

This was mostly due to a big enumeration that we do in Rules.hs, to generate
all the possible patterns for object files for 1) all ways, 2) all packages
and 3) all stages. Since rule enumeration is always performed, whatever the
target, we were always paying this cost, which seemed to grow bigger the
farther in the build we stopped and were resuming from.

Instead, this patch borrows the approach that we took for Rules.Library in
https://github.com/snowleopard/hadrian/pull/571, which exposes all the
relevant object files under as few catch-all rules as possible (8 here),
and parses all the information we need out of the object's path.

The concrete effect of this patch that I have observed is to reduce the
45-60 seconds pause to <5 seconds. Along with the Shake performance
improvements that Neil mentions in Trac #15938, most of the pause should
effectively disappear.

Diff Detail

rGHC Glasgow Haskell Compiler
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
alpmestan created this revision.Dec 4 2018, 2:58 PM

I'm importing the existing "infrastructure" for parsing from Rules.Library but I suspect a refactoring might be nice, putting the common utilities in a dedicated module. I'm not inspired for the name/place of that module, so I went for the shortest route to a working patch.

Suggestions welcome.

snowleopard requested changes to this revision.Dec 4 2018, 4:49 PM

@alpmestan Many thanks! In general this looks great, but I left a few comments/questions for you.


You removed the TODO, but we still have a few more rules that are generated for every vanilla context -- see forM_ vanillaContexts below. I suggest we keep (some variant of) this TODO, but perhaps make it more concrete by pointing out the rules that are still generated instead of being parsed.


Should be "...object and Haskell interface files"


Looks like this function is never called with r /= root, so we could just get rid of the parameter r.


This is just a question: Do we really need to separate Haskell from other languages? Could we have a more uniform representation for all possible source languages, without treating Haskell specially? I have a feeling that this could lead to a simplification of the code below, but I may be wrong.


Seems like a temporary hack.


Shall we move generally-useful functions like this one, parseWaySuffix above, and parseWayUnit below to Way.hs?

This revision now requires changes to proceed.Dec 4 2018, 4:49 PM
alpmestan updated this revision to Diff 19010.Dec 4 2018, 5:14 PM
alpmestan marked an inline comment as done.
  • doc improvement

Thanks for the quick feedback!


It is indeed never called with r /= root except that root isn't in scope in the where clause, we only get our hands on it in the do block, it's not an argument.


I think it's better to keep this patch's current approach

To sum things up, we could probably have a more uniform representation, but that would quite likely make illegal things representable. While the representation I use in this patch allows us to model only things that make sense, and to model all of them.

We're never going to find a .hi file under a .../build/cmm/ directory, and our types reflect that. hi files are build artifacts that result from compiling Haskell modules. It also nicely reflects the directory structure, so there are less hoops to go through when reasoning about them:

  • NonHsObjects live under .../build/{c, cmm, s}/ dirs, they don't produce anything but object files.
  • HsObjects live under .../build/ (possibly in nested directories, the first one not called {c, cmm, s}) and can produce (boot and non boot) object and interface files.

Yes, see my comment about the refactoring (first comment, not in the initial description of the patch).


See my comment about the refactoring. I'd rather consolidate all that infrastructure into a dedicated module or two, instead of scattering around everything next to where the types it is related to live.

In a very distant universe, we could even imagine going for this approach quite radically, where everything we may ever care about to generate/build/etc is expressed with such types, and where we can bidirectionally go from/to file path to/from structured Haskell representation, possibly preventing the need for any enumeration. It would also provide a nicely structured and inspectable representation of "build plans" etc. This of course omits all the extra things we do with our Args machinery. But there might be some merit to thinking along those lines, who knows.

If we were to make any step further in that direction, I think I would want to supply all the building blocks in a common place, perhaps under Hadrian. since it's supposed to the the "library" part.

snowleopard added inline comments.Dec 4 2018, 5:33 PM

I see, makes sense.


Let me make my vague comment a bit more concrete. Here is what you currently have:

data SourceLang = Asm | C | Cmm
  deriving (Eq, Show)

data Extension = Extension Way SuffixType
  deriving (Eq, Show)

data HsObject = HsObject Basename Extension
  deriving (Eq, Show)

data NonHsObject = NonHsObject SourceLang Basename Way
  deriving (Eq, Show)

data Object = Hs HsObject | NonHs NonHsObject
  deriving (Eq, Show)

I can see some overlap between Haskell and non-Haskell data types (Basename and Way are repeated in alternatives of the sum types).

Why can't we have the following instead?

data Language = Haskell SuffixType | Asm | C | Cmm
  deriving (Eq, Show)

data Object = Object Basename Way Language 
  deriving (Eq, Show)

This seems to be isomorphic to your current type, but is simpler and more uniform. There are no illegal combinations here.


Moving a generic part of this extension-parsing infrastructure to the Hadrian library sounds great!

It's perfectly fine if you prefer to do this in a separate patch, although perhaps leave a TODO to document the intent.

alpmestan marked 4 inline comments as done.Dec 5 2018, 4:14 AM
alpmestan added inline comments.

I was going to say I disagree, until I imagined your last code snippet with ObjectType instead of Language. Then it makes a lot more sense. An object is the basename of the file we're building, the way we're building it in and an "object type": object file from asm code, etc. Then it all becomes a lot easier to work with for me.

It might be a bit tricky though, since when we're parsing, we find out whether we're dealing with asm/c/cmm at the very beginning, while we only get the SuffixType at the very, very end. If this becomes too painful to work with, I'll come back here and suggest that we keep the current data types, as they are. A little bit more uniformity IMO shouldn't justify making the parsers too complex.

alpmestan added inline comments.Dec 5 2018, 4:40 AM

Hmm, I might get away with a bit of knot tying. Stay tuned.

alpmestan added inline comments.Dec 5 2018, 4:57 AM

I'm afraid my idea won't work, as ParsecT isn't a MonadFix. I will only push the refactoring of common bits into the dedicated module and suggest that we keep the types as they are for now. The fact that this unified Language data type would involve data that we parse either at the very beginning or at the very end really makes things a bit too painful in my opinion.

alpmestan updated this revision to Diff 19014.Dec 5 2018, 5:01 AM
  • move common path parsing bits in Hadrian.BuildPath
snowleopard added inline comments.Dec 5 2018, 11:28 AM

Right I see. Thanks for giving this idea a try.

Maybe it's worth adding a comment here, which says that logically we want to have the following isomorphic data type:

data ObjectType = Haskell SuffixType | Asm | C | Cmm
  deriving (Eq, Show)

data Object = Object Basename Way ObjectType
  deriving (Eq, Show)

But for the sake of keeping the parser simple, we use a somewhat less uniform encoding. This may help if someone (like me) is wondering about why this particular encoding has been chosen.

alpmestan marked 3 inline comments as done.Dec 6 2018, 4:40 AM
alpmestan added inline comments.

Sure thing, I'll add a quick note about this shortly.

alpmestan updated this revision to Diff 19032.Dec 6 2018, 4:49 AM
  • fix BuildPath docs
  • add a note about object files representation

Alright, I think I addressed all of your feedback. How is this patch looking now?

snowleopard accepted this revision.Dec 6 2018, 9:33 AM

@alpmestan Many thanks! Your patch looks great now.

This revision is now accepted and ready to land.Dec 6 2018, 9:33 AM
This revision was automatically updated to reflect the committed changes.