Introduce ExpPatFrame
Needs RevisionPublic

Authored by int-index on Dec 3 2018, 8:25 PM.

Details

Summary

The first step in the ExpPatFrame refactoring: https://ghc.haskell.org/trac/ghc/wiki/Design/ExpPatFrame

  • Create a new intermediate type, ExpPatFrame.
  • Change Parser.y to produce ExpPatFrame for expressions.
  • Write checkExpr of type LExpPatFrame -> LHsExpr GhcPs.
  • Modify checkPattern to have type LExpPatFrame -> P (LHsPat GhcPs).
  • Modify checkCommand to have type LExpPatFrame -> P (LHsCmd GhcPs).

The next step is to change checkExpr to work in the P monad and validate expressions there.

Test Plan

Validate

int-index created this revision.Dec 3 2018, 8:25 PM
int-index updated this revision to Diff 19001.Dec 4 2018, 4:54 AM

Rebase, use LExpPatFrame in 'checkValSigLhs'

int-index updated this revision to Diff 19003.Dec 4 2018, 10:21 AM

Refactor 'checkCommand' to use ExpPatFrame

int-index edited the summary of this revision. (Show Details)Dec 4 2018, 10:25 AM
int-index edited the summary of this revision. (Show Details)
int-index updated this revision to Diff 19005.Dec 4 2018, 10:37 AM

Remove outdated TODO

Thanks for doing this.

I'm very doubtful about the design though. Do we really need yet another huge data type?

We have the Trees That Grow infrastructure https://ghc.haskell.org/trac/ghc/wiki/ImplementingTreesThatGrow, precisely for the case where we have some ad-hoc extension of a data type. The overall goal is to have one Haskell data type, used for HsSyn and template Haskell, rather than two. This patch moves in the opposite direction.

Ant I think it's not hard. We can use HsExpr as now, but put EWildPat etc in the extension constructor, so that it's only present in HsExpr GhcPs and not in any other variant. Or you could even have a new HsExpr GHcParser, very short lived, just serves an intermediate in the parsing step.

It's not clear to me how the new stuff solves the problems under "The Problem" on the wiki page https://ghc.haskell.org/trac/ghc/wiki/Design/ExpPatFrame, but I expect that the TTG approach could adopt those same solutions.

I'm very doubtful about the design though. Do we really need yet another huge data type?

Well, I still think it should be a dedicated data type, huge or not. The TTG infrastructure does not account for this case.

First, as you noted, this type does not correspond to GhcPs, GhcRn, or GhcTc. It stores a part of the user input in a pre-GhcPs stage. When we have HsExpr GhcPs, it is an expression which is a result of parsing, but ExpPatFrame is something that is used *during* parsing, so if we introduced it with TTG, we would create a new stage GhcPrePs for it.

Second, this type is not an extension of HsExpr or HsPat - it's a hybrid between them. We cannot use TTG for it: what type do we extend, HsExpr or HsPat? And in the future, when I plan to parse types into it as well, do we extend HsExpr, HsPat, or HsType? These are equally good candidates, the arbitrary choice of HsExpr does not feel right.

Third, one of the planned modifications to ExpPatFrame is to replace FrameApp, FrameOpApp, FrameStatic, and FrameAppType, etc, with a constructor called FrameInfix which would store a sequence of operators/operands, and then put together a HsExpr or HsPat using mergeInfixExp and mergeInfixPat - similar to mergeOps that we have for types today. This change is what will fix Trac #1087. (I tested it in my prototype, linked from the Wiki page). At this point, it won't be merely an extension/union of HsPat/HsExpr - it will have some things removed, some things added.

There's often a trade-off between DRY and loose coupling. I prefer loose coupling in this case: this type is only ever used in the parser to dump an intermediate result of parsing. This result is neither an expression nor a pattern, it's an intermediate ADT with whatever the parser needs to put into it. Fitting it into the TTG infrastructure seems to be a futile exercise.

we would create a new stage GhcPrePs for it.

Yes, that'd be fine.

We cannot use TTG for it: what type do we extend, HsExpr or HsPat?

Why "cannot"? Personally I'd extend HsExpr because that would mean that in the common case there little or no work to do.

I'm afraid I don't understand the bit about "in the future I plan to parse types into it".

Third, one of the planned modifications to ExpPatFrame is to replace FrameApp, FrameOpApp, FrameStatic, and FrameAppType, etc, with a constructor called FrameInfix which would store a sequence of operators/operands,

That'd be fine: FrameInfix would simply be in the extension constructor part, along with EWildPat etc.

At this point, it won't be merely an extension/union of HsPat/HsExpr

But actually there'd be nothing wrong with still having HsApp etc in the type; perhaps in some situations you do know exactly what is happening so you can produce HsApp directly.

There's often a trade-off between DRY and loose coupling.

I agree. Sometimes repeating yourself is better. I'm just not yet convinced that this is one of them!

Perhaps the most concrete way to explore this would be to make a tiny version with only a few constructors, showing what would be necessary in each approach?

int-index added a comment.EditedDec 6 2018, 7:07 AM
we would create a new stage GhcPrePs for it.

Yes, that'd be fine.

We cannot use TTG for it: what type do we extend, HsExpr or HsPat?

Why "cannot"? Personally I'd extend HsExpr because that would mean that in the common case there little or no work to do.

There would be the same amount of work: we will need a complete pass to convert from HsExpr GhcPrePs to HsExpr GhcPs.

I'm afraid I don't understand the bit about "in the future I plan to parse types into it".

I will add FrameForallTy, FrameQualTy, and other constructors from HsType to ExpPatFrame. This will be useful when we add features where we don't know whether we're parsing a type or a term, for instance visible dependent quantification in terms, fn :: forall a -> ..., and other Dependent Haskell related features.

Perhaps the most concrete way to explore this would be to make a tiny version with only a few constructors, showing what would be necessary in each approach?

A tiny version of what, the TTG variation? Well, I do see how it would work, I'm just very reluctant to commit to HsExpr as the base for the extension: we're not parsing a HsExpr, we're parsing an expression-pattern-looking thingy. Which can turn out to be HsExpr, HsPat, or even HsCmd (see checkCommand), and in the future - HsType. So we're going to have very similar-looking functions:

checkPattern :: ExpPatFrame -> P (HsPat GhcPs)
checkExpr :: ExpPatFrame -> P (HsExpr GhcPs)
checkCommand :: ExpPatFrame -> P (LHsCmd GhcPs)
checkType :: ExpPatFrame -> P (LHsType GhcPs)

but if we use HsExpr GhcPrePs as input, then it makes HsExpr somehow special, which it isn't. And again – we're not avoiding work in the "common" case, we will still have to do checkExpr to convert from HsExpr GhcPrePs to HsExpr GhcPs.

On the other hand, for Dependent Haskell we will probably have to merge HsExpr and HsType altogether, throughout the entire pipeline. So this bit of motivation is gone.

And between HsPat, HsCmd, and HsExpr, I can sort of agree that HsExpr is the "primary" case, although I'm not sure why, might be your influence :) Color me convinced, I will make a TTG-based variation with a separate GhcPrePs phase.

int-index added a comment.EditedDec 6 2018, 11:45 AM

@simonpj There are two options how I can introduce GhcPrePs:

  1. As another constructor here:
data Pass = Parsed | Renamed | Typechecked
         deriving (Data)

-- Type synonyms as a shorthand for tagging
type GhcPs   = GhcPass 'Parsed      -- Old 'RdrName' type param
type GhcRn   = GhcPass 'Renamed     -- Old 'Name' type param
type GhcTc   = GhcPass 'Typechecked -- Old 'Id' type para,
type GhcTcId = GhcTc                -- Old 'TcId' type param
  1. As an entirely separate declaration elsewhere (e.g. parser/RdrHsSyn.hs).

I'm going with option 2 for the following reasons:

  1. Having all passes listed in a single place isn't a modular design. I want the rest of the system to be unaware of GhcPrePs - the entire pass a parsing hack, after all.
  2. The extension type instances (e.g. type instance XXExpr GhcPs) will be set to parsing-related structures, which must not leak to hsSyn/HsExpr.hs or hsSyn/HsExtension.hs

Does this sound reasonable?

It will result in a sizeable amount of change to all Outputable instances, as they assume p ~ GhcPass pass. But I think it's a good change, we'd like to output non-GhcPass syntax trees too, do you agree?

Does this sound reasonable?

Actually my instinct is for option 1.

  • All of these passes are part of GHC, a single piece of software. It makes sense for HsExpr x to have x ::* so that x is open and can be added to later. But for GHC, it's great to use HsExpr (GhcPass p) where p :: Pass. Now Pass can be a closed type, and we know for sure when we have covered all cases.
  • We can define GHC's extension fields using a closed type family, with ordered equations and a fall-through. This makes it easy to see, for each pass, the type of each field is.

So I see it as a positive benefit that we enumerate the passes.

I'm afraid I didn't understand "2. The extension type instances ....will be set to parsing-related structures"

Thanks for pointing me to this.
I have taken a quick look.
I agree this is the right place to use TTG.
In fact, I pitched TTG in Japan (HiW'16), by demonstrating a bug (now patched), caused by the very confusion of Expr and Pat, that allowed compiling a module containing only the text`boo@ooo` leading to a GHC panic :)
The other Haskell parsers,e.g. Haskell-Src-Exts (HSE), also often have a "pre-parsing" (or "mid-parsing") AST.
I am not yet sure about some details here, like what suits parsing in the presence of ambiguity best. I will take a closer look at int-index's use of FrameX tomorrow.

Also, I am not sure I fully understand int-index's Option 2.

Having all passes listed in a single place isn't a modular design. I want the rest of the system to be unaware of GhcPrePs - the entire pass a parsing hack, after all.

As Simon mentioned, that's why we used an extension parameter of the "open kind" Type and open type families for the *top-level* extension fields and constructor.
For example, Haddock's AST also uses TTG now, and it is defined entirely in a separate module.

Also, as Simon mentioned, for simplicity and coherence, it is convenient to add a PreParsed constructor to Pass.
Though, at the moment I don't have any strong opinion against introducing a separate extension descriptor data GhcPrePs.

[int-index:]
The extension type instances (e.g. type instance XXExpr GhcPs) will be set to parsing-related structures, which must not leak to hsSyn/HsExpr.hs or hsSyn/HsExtension.hs

If you mean type instances like type instance XXExpr GhcPrePs to include parsing-related structures, and type instance XXExpr GhcPs to be set to Void (or an empty datatype NoNewCon), Yes! I agree.

It will result in a sizeable amount of change to all Outputable instances, as they assume p ~ GhcPass pass. But I think it's a good change, we'd like to output non-GhcPass syntax trees too, do you agree?

I agree we want to output non-GhcPass syntax trees.
We started the following

https://ghc.haskell.org/trac/ghc/wiki/HaskellSyntaxPrinters

and

https://ghc.haskell.org/trac/ghc/wiki/ImplementingTreesThatGrow/Instances

but soon we realised the performance cost of highly parametric Outputable (or Data) instances.
It is just a practical choice, especially for the GHC-specific debug-printer Outputable instances provide.

If we really are ignoring the extension points (the new fields and constructor) when debug-printing, there should be no need to mention p ~ GhcPass pass at all.
But, as far as I remember, we do want to print GHC-specific data stored in the extensions.

int-index added a comment.EditedDec 7 2018, 2:29 AM

To clarify, option one is:

data Pass = PreParsed | Parsed | Renamed | Typechecked
type GhcPrePs = Pass 'PreParsed

Option two is:

data GhcPrePs

Indeed, the second option has precedence with haddock.

So, why I'm rooting for the second option? I have two reasons.

Reason 1.

Consider the following TTG instances we have right now:

type instance XUnboundVar    (GhcPass _) = NoExt
type instance XConLikeOut    (GhcPass _) = NoExt
type instance XRecFld        (GhcPass _) = NoExt
type instance XRnBracketOut  (GhcPass _) = NoExt
type instance XTcBracketOut  (GhcPass _) = NoExt
type instance XWrap          (GhcPass _) = NoExt

For GhcPrePs, they (and some others) will have to be changed. For example, XUnboundVar will become:

type instance XUnboundVar    GhcPs = NoExt
type instance XUnboundVar    GhcRn = NoExt
type instance XUnboundVar    GhcTc = NoExt
type instance XUnboundVar GhcPrePs = Void

Why Void? That's because the Happy parser never produces these constructors, so during conversion we want to witness it using absurd :: Void -> a. Thus we're losing some uniformity in GhcPass _ instances because of GhcPrePs.

And when I proceed with the refactoring, the same transformation will also affect the following instances:

type instance XApp           (GhcPass _) = NoExt
type instance XAppTypeE      (GhcPass _) = NoExt
type instance XXExpr         (GhcPass _) = NoExt

Reason 2.

When I add parsing-specific structures, where do I put them? In my prototype:

data ExpPatFrame
  = ...
  | ...
  | FrameInfix [LFrameInfixEl] -- reversed
    -- ^ A sequence with infix elements: a + f x y +, !a + !b

type LFrameInfixEl = Located FrameInfixEl
data FrameInfixEl
  = FrameOpr ExpPatFrame
  | FrameOpd ExpPatFrame
  | FrameVisTy SrcSpan (LHsWcType GhcPs)
  | FrameMinus
  | FrameTilde
  | FrameBang
  | FrameDot
  | FrameStar Bool -- isUnicode
  | FrameStatic

With TTG, I will define:

data ParsingStructures
  = FrameInfix [LFrameInfixL] -- just this one, for now, but maybe others in the future

type instance XXExpr GhcPrePs = ParsingStructures

So, where does the definition of ParsingStructures go? I'd like to put it somewhere under the parser/ namespace, and the rest of the system would rather not know of its existence. However, how do I define the XXExpr GhcPrePs instance then?

  1. If it's defined in the hsSyn/HsExpr.hs module, then I will have to move ParsingStructures to HsExpr.hs as well.
  2. If it's defined in parser/RdrHsSyn.hs, as intended, then the instance will be an orphan, indicative of wrong architecture.

If I define data GhcPrePs in parser/RdrHsSyn.hs and all instances for it there as well, then I face no such issues.


but soon we realised the performance cost of highly parametric Outputable (or Data) instances.

@shayan-najd What exactly is costly? In the page you linked, I see:

every time the constraint set is enlarged as the next step of Trees that Grow goes in, the time taken to compile GHC increases.

Does this mean that if we avoid large constraint sets, we solve the performance issue? In this case, I have an idea:

class OutputableX p where
  withOutputableX :: (LargeConstraintSet p => r) -> r

instance OutputableX GhcPs where
  withOutputableX r = r

instance OutputableX GhcRn where
  withOutputableX r = r

instance OutputableX GhcTc where
  withOutputableX r = r

Then all Outputable instances will have just a single constraint, OutputableX:

instance OutputableX p => Outputable (HsExpr p) where
  ...

Instance search for LargeConstraintSet will be done only once per pass (in the OutputableX Ghc** instances), and then the pretty-printers will be able to call withOutputableX whenever they need to print an extension field, without overhead.

int-index added a comment.EditedDec 7 2018, 3:05 AM

All of these passes are part of GHC, a single piece of software.

@simonpj Isn't this line of reasoning exactly what leads to unmaintainable monolithic software? We should think of GHC as a collection of independent, smaller pieces of software, that we compose together.

I expect each pass to be aware only of its input and output structures:

  • The parser knows about String (its input) and hsSyn GhcPs (its output)
  • The renamer knows about hsSyn GhcPs (its input) and hsSyn GhcRn (its output)
  • The typechecker knows about hsSyn GhcRn (its input) and hsSyn GhcTc (its output)

If we add a pass before the parser, the only component in this list that should notice it is the parser! Why would the renamer or the typecheker need to know about GhcPrePs? And yet, with a closed type, they do notice: I have to recompile them (as the HsExpr module gets new instances), and they will get access to ParsingStructures.

Ideally, GhcPass should not exist, it's a witness of a monolithic architecture. Unless you're fine with monolithic, because for you it's easy to keep all of GHC in your head :)

I expect each pass to be aware only of its input and output structures:

As you point out, it's not an internal matter to a single pass, it's to do with the communication *between* passes.

Your argument has merit, clearly, but if we accepted it we should get rid of GhcPass and do everything with Type. What i think we should not do is to make two different choices for essentially the same decision. Let's be consistent.

Personally I like GhcPass because it allows me to enumerate the passes in one place; and for each extension field it ensures that the life-cycle of that field is described in one place. I like that because it gives me an overview of the information transformation pipeline, at least for that field. But yes, that very overview means I'm thinking of the thing as a whole -- and I want to!

I agree that it does mean that, the closed-type-family definition has to mention all of the types that fill that extension field in all of the passes. That's the biggest downside. But it's one that has not caused us much pain thus far, but perhaps it will become more pressing. (Remember that HsExpr itself should be completely independent of GHC, sitting in a separate library, and we should move in that direction.)

Anyway, currently we have the GhcPass thing. You can make the case for removing it, but perhaps that should be a separate patch.
But meanwhile, would you be willing to continue ot use GhcPass for now? I know you don't like it!

I'm worried about these gigantic instance contexts. I don't understand your suggested solution yet, Vladislav; could you give a concrete example?

This isn't really the place to discuss it either, else we'll derail this patch.

I've added a new Plan G to https://ghc.haskell.org/trac/ghc/wiki/ImplementingTreesThatGrow/Instances. Perhaps if you have Plan H, you could add it, Vladislav?

I have taken another look, and now understand int-index's argument.

I believe Simon has addressed all the concerns:

  • Until we get practically motivated (e.g., at the time we want to split off the parser as a separate library like Haskell-Src-Exts), we may want to keep using GhcPass (The uniformity provided by GhcPass makes many things easier practically. For example, we just noticed with pre-parsing being part of`GhcPass`, the Outputable do not need to change.) We had a relevant discussion on nested closed type families and dependencies at the last bullet point before Pros & Cons section at https://ghc.haskell.org/trac/ghc/wiki/ImplementingTreesThatGrow/HandlingSourceLocations
  • Thanks to using GhcPass the Outputable do not need to be changed
  • The further discussion about the interaction of TTG and type classes, e.g. int-index's suggestion, can be discussed via the wiki page (and probably emails to supplement it).

About the specific question

[int-index:]
@shayan-najd What exactly is costly?

Both solving the large constraint set, and all the large dictionaries passed around.
Just ignore the type classes for a minute, imagine a specialised variant, i.e. a single pprHsExpr function specialised for debug-printing expressions.
A parametric variant will have tens of function arguments just to debug-print the extension fields!
pprExpr :: (XHsVar x, XHsLit x, ...) => HsExpr x -> SDoc
We can discuss this further via the wiki/email.

About the specific comment

[int-index:]
type instance XUnboundVar GhcPs = NoExt
type instance XUnboundVar GhcRn = NoExt
type instance XUnboundVar GhcTc = NoExt
type instance XUnboundVar GhcPrePs = Void

just a side note that I believe UnboundVar should be Void for GhcPs too: it is introduced in renaming.
We have a long list of such corrections waiting to be committed after the [TTG: Handling Source Locations] patches.

I've added a new Plan G

Isn't it the same as Plan D, which for some reason is in the "Outdated/Infeasible" section? Is the only difference that we are concerned with Outputable only?

Perhaps if you have Plan H, you could add it, Vladislav?

Done. I added Plan H and Plan H*

Both solving the large constraint set, and all the large dictionaries passed around.

Passing a single large dictionary (as I suggest in Plan H*) shouldn't be that bad, right? It's just a single pointer, operationally.

But meanwhile, would you be willing to continue ot use `GhcPass` for now? I know you don't like it!

So not only we're making the dubious decision of extending HsExpr when parsing things that can be not-HsExpr (HsPat and HsCmd), we're also doing it in an anti-modular way.

Okay, not that I have a choice, given that I want to see progress in this direction (my end goal is to unify the term/type parsers).

I've added a new Plan G

Isn't it the same as Plan D, which for some reason is in the "Outdated/Infeasible" section? Is the only difference that we are concerned with Outputable only?

Perhaps if you have Plan H, you could add it, Vladislav?

Done. I added Plan H and Plan H*

Both solving the large constraint set, and all the large dictionaries passed around.

Passing a single large dictionary (as I suggest in Plan H*) shouldn't be that bad, right? It's just a single pointer, operationally.

Thanks for adding the sections, we will follow up on Outputable instances there.

But meanwhile, would you be willing to continue ot use `GhcPass` for now? I know you don't like it!

So not only we're making the dubious decision of extending HsExpr when parsing things that can be not-HsExpr (HsPat and HsCmd), we're also doing it in an anti-modular way.

I am confused about this comment; what is dubious about having HsExpr (GhcPass GhcPp), where GhcPp is reserved for Pre-Parsing, as the type of expression-like entities that includes HsPat and HsCmd?
Isn't it the same as your ExpPatFrame?
~~~
type ExpPatFrame = HsExpr (GhcPass GhcPp)
~~~
(and now you can remove all the duplicated datatype declarations)

int-index added a comment.EditedDec 7 2018, 7:25 AM

I am confused about this comment; what is dubious about having HsExpr (GhcPass GhcPp), where GhcPp is reserved for Pre-Parsing, as the type of expression-like entities that includes HsPat and HsCmd? Isn't it the same as your ExpPatFrame?

No, it's not the same. ExpPatFrame is a completely separate type from all three: distinct from HsPat, distinct from HsCmd, distinct from HsExpr. If we use HsExpr GhcPrePs, then it's an extension of HsExpr, but why HsExpr? Why aren't we using HsPat GhcPrePs or HsCmd GhcPrePs where we could put other fields as extensions? The choice of HsExpr doesn't come from any fundamental principle.

By the way, I'm facing the following issue. Consider this constructor:

| ExprWithTySig
              (XExprWithTySig p)
              (LHsExpr p)
              (LHsSigWcType (NoGhcTc p))

What do I do with the type when p ~ GhcPrePs?

  1. I can modify NoGhcTc to map GhcPrePs to GhcPs. If I do this, then the name NoGhcTc is no longer appropriate. Would that be OK?
  2. I can add another type family NoGhcPrePs and apply it in addition to NoGhcTc. This will likely lead to more constraints, we already have stuff like NoGhcTc id ~ NoGhcTc (NoGhcTc id)
  3. I can work with HsType GhcPrePs and convert it to HsType GhcPs, just like I'm converting HsExpr. This will lead to a certain amount of boilerplate.

Found another option how to deal with ExprWithTySig: not using it. I instantiate XXExpr GhcPrePs with

data PreCmdPat
  = PreWildPat
  | PreAsPat (Located (IdP GhcPrePs)) (LHsExpr GhcPrePs)
  | PreViewPat (LHsExpr GhcPrePs) (LHsExpr GhcPrePs)
  | PreLazyPat (LHsExpr GhcPrePs)
  | PreArrApp (LHsExpr GhcPrePs) (LHsExpr GhcPrePs) HsArrAppType Bool
  | PreArrForm (LHsExpr GhcPrePs) [LHsCmdTop GhcPrePs]

and I can add another constructor to it:

| PreTySig (LHsExpr GhcPrePs) (LHsType GhcPs)

It hardcodes GhcPs for LHsType - good enough for now. In the future HsExpr and HsType will become one type anyway, solving this issue completely.

int-index added a comment.EditedDec 7 2018, 12:17 PM

A few other stumbling blocks: HsLocalBinds (in HsLet, GRHSs) and Pat (in Match, Stmt).

I don't ever need these two to be parametrized by GhcPrePs, they should be GhcPs right away. If we parametrize them by GhcPrePs first and convert later, it will really bloat conversion code, and with runtime cost too (we will be traversing things we don't need to traverse).

Here are the problematic LHsLocalBinds and LPat:

data Match p body
  = Match {
        ...
        m_pats :: [LPat p],
        ...
  }

data HsExpr p
    ...
  | HsLet ...
                (LHsLocalBinds p)
                ...

data GRHSs p body
  = GRHSs {
      ...
      grhssLocalBinds :: LHsLocalBinds p
    }


data StmtLR idL idR body
  ...
  | BindStmt
             (LPat idL)
             ...

@simonpj @shayan-najd Do you have suggestions?

My view is that none of this is worth solving, because TTG isn't the right approach in this particular case (I like TTG, it's just not for this use case):

  1. We're not parsing a HsExpr, so extending HsExpr is not the right thing to do.
  2. GhcPrePs is not really a compiler pass: it is not associated with the entire AST. We will never have Pat GhcPrePs, or HsType GhcPrePs, or ClsInstDecl GhcPrePs, or FamInstEqn GhcPrePs, ... and so on.

This does not fit into the TTG architecture, which is designed to make the AST extensible for various compiler passes. But this is neither an extension nor a proper compiler pass!

I pushed my WIP to https://github.com/serokell/ghc/commit/aa4861a4fca6cc6926d6f7a5c538cb77ca243d7d. Do we really want to continue in this direction? ExpPatFrame has none of these issues, and it's completely local to the parser as a bonus. The only complaint about it that I've heard so far is that it's a "huge" data type, but with TTG we'll end up with approximately the same amount of LOC by the time we're done.

I'm late to this conversation, and I haven't looked at the code -- only the conversation. Here are my thoughts:

  • I generally side with @int-index on the design questions. (And there has been no collusion between us!)
  • Specifically, I support ExpPatFrame as a standalone datatype, independent of the TTG infrastructure. I enumerate my reasons (many in overlap with arguments made previously):
    • The combined type is neither an extension of HsExpr nor of HsPat. It belongs squarely between the two, and the choice between them is arbitrary.
    • Using TTG does not save passes.
    • The new datatype is entirely local to the parser. This is in contrast to the other GhcPasses which communicate between phases of the compiler. While I appreciate how TTG has all the information about an AST node in one place that models that node's life-cycle -- and I, like Simon, have found this valuable -- the case in this patch is different. The ExpPatFrame is so local, I would find its influence on AST design to be befuddling.
    • Without the use of yet more type families (which cause, e.g., problems with writing instances), the TTG approach would require passing through large swathes of AST that would not be changing. This is both a waste of code and a slowdown at runtime.
  • I do think that designing all of this with an eye toward merging types in is appropriate. I had two undergrads spend their summer attempting to merge the term-level and type-level parsers. Their work didn't really come to fruition, but we realized that there was no path forward, at all, without something like ExpPatFrame. It's true that dependent types (or even just visible dependent foralls in types of terms) would require more than just ExpPatFrame, ExpPatFrame is one step along the way.
  • I must say I dislike the name ExpPatFrame. How about PTerm (for "parser term")? In any case, don't let this derail the bigger issues.
  • I'll admit I see the other side of the story, in favor of the TTG approach. What really pushes me against TTG here is the locality of the new datatype. It will be used in Parser.y and RdrHsSyn, and nowhere else. So it makes sense for its definition to be in those modules, too.
  • Note that I'm not against TTG in general. Indeed, I sometimes wonder if we should do TTG in the Core datatypes so that we can easily have TcType be distinct from Type (and have it so that only TcType can hold a TcTyVar). This is a totally separate issue, but I see how TTG solves problems.... just not the one Vlad is trying to solve here.

The new datatype is entirely local to the parser. This is in contrast to the other GhcPasses which communicate between phases of the compiler.

That's just a question of what you consider a "pass"!

I don't ever need these two to be parametrized by GhcPrePs, they should be GhcPs right away. If we parametrize them by GhcPrePs first and convert later, it will really bloat conversion code, and with runtime cost too (we will be traversing things we don't need to traverse).

Ah, now *this* is convincing. I think you are saying this:

  • We often don't know (to begin with) whether we are parsing a term or a pattern; hence ExpPatFrame
  • We accept that we'll have to take a pass to convert ExpPatFram to Expr or Pat resp.
  • But when we see a let we know for sure that the bindings are terms, HsBinds GhcPs
  • So we absolutely do not want to treat that as HsBinds GhcPrePs and then convert it; extra code and extra runtime.
  • So the thing that distinguishes this "pass" is that it doesn't apply to the entire program; rather, we want to pre-parse little fragments, and the convert them to GhcPs.

Now that makes (to me) a much more convincing case for a new data type. Thanks.

But it raises the following question: how small can that datatype be. The point for let is that when we see a let we know we are in term-land. But isn't that also true for, say, do-notation? Could we not simply embed HsExpr GHcPs in ExpPatFrame, just as we do HsBinds GhcPs? Something like

data ExpPatFrame = ...
   | FrameExpr (HsExpr GhcPs)

Now, as soon as you find something that can only be an expression, you can embed it directly.

Would that not make the data type much smaller? Eg why do you have FrameStmt but not FrameBinds?

Would that not make the data type much smaller? Eg why do you have FrameStmt but not FrameBinds?

That's because of checkCommand. If we were only concerned with expressions and patterns, but not commands, we could eschew FrameStmt, FrameIf, FrameDo, FrameCase, etc.

How about PTerm (for "parser term")?

I don't have a strong preference for the name, but using the word "term" seems a bit misleading. Can we call a pattern a term? What about types, which I plan to parse into ExpPatFrame as well, can we call a type a term? My reasoning behind the name ExpPatFrame: it's a frame (something not completely fleshed out) for expressions and patterns. (I guess the proper name should be ExpCmdPatFrame, but that's too long).

So the thing that distinguishes this "pass" is that it doesn't apply to the entire program; rather, we want to pre-parse little fragments, and the convert them to GhcPs.

Yes, that's the idea. I'm happy we're on the same page now. I should've communicated more clearly. I extended the Wiki page with a summary of this discussion.

Regarding the FrameExpr idea, we should also include FramePat and FrameCommand then:

data ExpPatFrame = ...
  | FrameExpr (HsExpr GhcPs) -- unambiguously an expression
  | FramePat (HsPat GhcPs) -- unambiguously a pattern
  | FrameCommand (HsCmd GhcPs) -- unambiguously a command

and I added it to the Wiki, too. I will do this in a future iteration (need to add validation to checkExpr first).

Is there anything else holding up the merge? I'd like to move on to the next stage of the plan: adding validation to checkExpr.

(and I have updated the wiki page to summarise our email communication: https://ghc.haskell.org/trac/ghc/wiki/Design/ExpPatFrame#EmbeddingCommonTerms)

Unfortunately, while do or let cannot be used in patterns, they can be used in commands, so we end up duplicating most of HsExpr constructors for the sake of HsCmd. If not for this, we'd be able to make ExpPatFrame smaller.

Can you elaborate on this? I don't get it. Commands can't appear in patterns either. How big would the data type be if we didn't have to worry about commands?

Nevertheless, in the final iteration we will include constructors for unambiguous cases:

If it's good in the final iteration, isn't it good now? What will be different in the final iteration?

I'm sorry, I'm clearly still learning what the issues are.

int-index added a comment.EditedDec 10 2018, 5:25 PM

Can you elaborate on this? I don't get it. Commands can't appear in patterns either.

  1. do and if can appear in HsExpr with HsExpr as subtrees:
checkExpr (FrameIf c a b) = mkHsIf (checkExpr c) (checkExpr a) (checkExpr b)
checkExpr (FrameDo ctxt stmts) = mkHsDo ctxt (map checkExprStmt stmts)
  1. do and if can appear in HsCmd with HsCmd as subtrees:
checkCmd _ (FrameIf ep et ee) = do
    pt <- checkCommand et
    pe <- checkCommand ee
    return $ HsCmdIf noExt (Just noSyntaxExpr) (checkExpr ep) pt pe
checkCmd l (FrameDo DoExpr stmts) =
    mapM checkCmdLStmt stmts >>=
    (\ss -> return $ HsCmdDo noExt (cL l ss) )

Therefore, we need FrameDo (which becomes either HsDo or HsCmdDo) and FrameIf (which becomes either HsIf or HsCmdIf).

The general principle is that if we have a syntactic construct which could be interpreted in at least two ways depending on context, then it is considered ambiguous and we are in inherent need for an intermediate representation. The argument above shows that do and if are ambiguous: they are either expressions or commands. Other constructs (FrameTySig) are either expressions or patterns. And some (FramePar) can be all three.

How big would the data type be if we didn't have to worry about commands?

We would only include syntactic constructs that can be either expressions or patterns, not worrying about commands. So, since we know that do or if do not occur in patterns, we can leave them out. And lambdas, and case, and many others. We can get a precise answer by looking at the intersection of what checkPattern and checkExpr accept.

With commands, we need to look at the intersection of what checkPattern, checkExpr, and checkCommand accept. And when I get to unifying term/type parsers, also checkType (not currently defined).

If it's good in the final iteration, isn't it good now? What will be different in the final iteration?

FrameExpr is good now, but FramePat and FrameCmd aren't because checkExpr has no way to reject them (it's currently a pure function). I'd rather add all three at once, just so that we have a design as uniform as possible in each iteration.

We can get a precise answer by looking at the intersection of what checkPattern and checkExpr accept.

With commands, we need to look at the intersection of what checkPattern, checkExpr, and checkCommand accept. And when I get to unifying term/type parsers, also checkType (not currently defined).

This doesn't sound right. If it's intersections we're after, then adding more target types should reduce the size of the Frame type, and that's clearly not what's happening.

Actually, I think if we were to take this idea to its limit, we would have this:

  • Let E, P, and C stand for Expr, Pat, and Cmd, respectively.
  • Introduce four new types: EPFrame, ECFrame, PCFrame, and EPCFrame.
  • Each frame type contains as many forms as lie in the intersection between the types mentioned. This means, for example, that EPCFrame will be a subset of all the others.
  • Note that Expr, Pat, and Cmd are essentially frames of order 1 -- that is, the intersection is computed over only one set. (In contrast, EPCFrame has order 3.)
  • Adding T (for type) would give us 7 more frame types.

I actually think such a design would work out on a technical level and be optimal in some way, but it's far too baroque in reality. Instead, the Frame type in play here is actually the union of all frame types I've described here with an order greater than 1. That's how it's "intersection"-like but actually grows the more types we throw in.

I don't know if these observations are useful, but I think it gives us a principle when designing these types.

Good observation, thanks @goldfire. I couldn't decide if it's an intersection or a union, and now I see it's a union of intersections.

@simonpj Can we merge this, unless you have more questions/objections?

@simonpj Can we merge this, unless you have more questions/objections?

After this discussion I am much happier yes.

But I do have a request: could you summarise the discussion in the Note at the top of ExpPatFrame.hs? In particular:

  • The fact that we really want intersections: EPFrame, ECFrame etc, but to save data types we give up and take the union of those intersections.
  • The story I summarise in "Now, this is convincing" above.

And there is one more thing.

For HsCmd there really is no ambiguity at all! The grammar says

acmd    :: { LHsCmdTop GhcPs }
        : aexp2                 {% checkCommand $1 >>= \ cmd ->
                                    return (sL1 $1 $ HsCmdTop noExt cmd) }

The motivation here is NOT ambiguity but ONLY the desire not to duplicate the grammar of commands and expressions.

If that's it, then lets use HsExpr as before and convert to HsCmd. Or duplicate the grammar if you prefer. But the Command stuff dramatically blows up size of the Frame data type -- and for a completely different (and much less compelling) reason. Does that make sense?

int-index added a comment.EditedDec 18 2018, 12:24 PM

For HsCmd there really is no ambiguity at all! <...> The motivation here is NOT ambiguity but ONLY the desire not to duplicate the grammar of commands and expressions.

That's an interesting observation. I suppose we could also make use of parametrized productions to avoid duplication of the grammar. However, is there truly no ambiguity here? Consider:

proc x -> do { Con a b <- x } -- 'Con a b' is a pattern
proc x -> do { Con a b }      -- 'Con a b' is a command

Even if we don't have an expression/command ambiguity, we seem to have pattern/command ambiguity. So we'll end up with at least EPFrame and CPFrame, even if we don't need ECFrame.

On the other hand, could EPFrame and CPFrame be a single parametrized type, i.e. type EPFrame = PFrame HsExpr and type CPFrame = PFrame HsCmd? Then it's fine again. I will have to explore, thank you for this interesting thought. Maybe we'll end up with something small after all.

could you summarise the discussion in the Note at the top of ExpPatFrame.hs?

Of course. I will.

I will have to explore, thank you for this interesting thought. Maybe we'll end up with something small after all.

Thanks.

My instinct is to leave it exactly as-is; that is, parse commands as expressions and use checkCommand to convert. In the example you give

proc x -> do { Con a b <- x } -- 'Con a b' is a pattern
proc x -> do { Con a b }      -- 'Con a b' is a command

the expression will itself go via ExpPatFrame to resolve the pattern/expression ambiguity; but that's all over by the time checkCommand gets to see it.

You describe various ills that arise from parsing patterns as expressions and then converting. I don't think any of those ills apply to parsing commands as expressions, which seems straightforward.

By leaving commands as-is, you can make progress with the expression/pattern thing. Then if you want to get back to commands later, you still can

Thanks for doing this

bgamari requested changes to this revision.Jan 20 2019, 6:39 PM

Bumping out of the review queue while work continues.

This revision now requires changes to proceed.Jan 20 2019, 6:39 PM