Pointer tagging for all families
Needs RevisionPublic

Authored by ggreif on Dec 14 2017, 10:25 AM.

Details

Reviewers
bgamari
simonmar
Trac Issues
#14373
Summary

Formerly big constructor families only got one tag: 1, now we are better

Test Plan

build and run testsuite, how to improve this code?

ggreif created this revision.Dec 14 2017, 10:25 AM

Looks reasonable to me. Have you done any performance characterisation?

compiler/codeGen/StgCmmExpr.hs
694

Can we add a reference to Trac #14373 here?

Self-critique.

compiler/codeGen/StgCmmExpr.hs
613

This looks like map (first succ) branches to me.

636

if you have a look into emitSwitch, you discover that it already has its join_lbl management. A helper that returns that would be handy here.

638

Jumping around the following island looks like a kludge to me. but not sure how to emit nicer code. Anyway I hope that the ASM/LLVM emitter will eliminate these jumps. Any c-- expert around who can give a clue?

642

The lambda looks like first pred to me.

ggreif marked 3 inline comments as done.EditedDec 14 2017, 3:49 PM

Some comments will be resolved when I push next.

ggreif updated this revision to Diff 14945.Dec 15 2017, 4:22 AM
  • first round of review feedback

Let's say we have a big family BigFam with

data BigFam = A | ... | Z

it has lots of constructors! We know that As can be discriminated by the tag bits and we know that we have to look at the info table for discriminating Z s.

IIUC this how you generate code for big families with your patch right now:

case e :: BigFam of x { -- look at `x` tag bits
  A -> ...              -- nice, we only had to look at `x`s tag bit for `A`s!
  ... -> ... 
  _  ->                 -- ok, x must be some constructor which can't be discriminated by the tag bits!
    case x of _ {       -- take a look at `x`s info table here
      ... -> ... 
      Z -> ...          -- ok, `x`s info table told us this must be `Z`
    }
}

But let's say we have a case expression which is only interested in Zs, you would generate something like this:

case e :: BigFam of x { -- again inspect the tag bits here
  _ ->                  -- we are only interested in constructors above our tag bits range
    case x of _ {       -- inspect `x`s info table here
      Z -> ... 
    }
}

So why not directly generate?

case e :: BigFam of x {  -- inspect `x`s info table here
      Z -> ... 
}

@alexbiehl I am working on a slight tweak to emitSwitch, which will automatically take care of the degenerate cases that can arise. Thanks for the input!

This is tricky stuff! I'm on board with it, but please document it well. Is Note [tagging big families] enough? I'd prefer to see some concrete examples.

ggreif updated this revision to Diff 14947.Dec 16 2017, 6:59 AM
  • note tweaks
  • Extra argument for emitting pre-join label code
ggreif marked 2 inline comments as done.Dec 16 2017, 7:04 AM

I have resolved all coding issues. Remaining stuff:

  1. tests
  2. comments
  3. perf measurements

CircleCI and Travis are both green.

Sounds great!

It's hard to review without

  • A careful explanation of the problem you are trying to solve
  • A description of the strategy you use to solve it
  • Examples to illustrate the code you want to generate in various cases

Does performance change? After all, that's the goal isn't it?

Thanks for working on this.

Simon

I like the idea. Can you put an example of the generated code in the test plan so we can check what's being generated?

What are the code size implications? Do you measure any difference with nofib?

We should document the fact that it's best to put the most commonly used constructors at the beginning in a large datatype, I think the User Guide has a section on optimisation tips. (in fact I think this is also true of small datatypes, due to the ordering of comparisons, but it would be good to document it along with this change anyway)

compiler/codeGen/StgCmmUtils.hs
456

This seems really strange. Nothing can fall through to this code because all of the branches jump to the join label, so this must be a self-contained labelled block, right? In which case why not emit it with emitOutOfLine?

@simonmar @simonpj Over the next few days I'll try to write up everything (on a wiki?)

compiler/codeGen/StgCmmUtils.hs
456

What I want is

switch [1..7] (R1 & 7) -- on ptr tag
  1 --> lbl0
  2 --> lbl1
  ...
  7 --> info_fallbackLbl

<pre-join insertion comes here>
info_fallbackLbl:
switch [6..20] (R1->infoTag)
  6 --> lbl6
  7 --> lbl7
  ...
</pre-join insertion>

joinLbl:  -- lbl0 .. lbl19 all finally branch here
  <continuation>

Normally <pre-join insertion> is empty (i.e. pure ()), but with my double-switch scheme I need a cozy place where the inner switch should go. I'll have a look at emitOutOfLine and come back with an opinion whether it serves my purposes well.

ggreif updated this revision to Diff 14955.Dec 18 2017, 6:44 AM
  • Handle the case when ptr tag is telling no story
ggreif added inline comments.Dec 18 2017, 6:48 AM
compiler/codeGen/StgCmmUtils.hs
462

I never trigger this case, I think it can be removed.

@alexbiehl I have added a test case for your question (suggestion)

make test TEST=T14373 and look for Main.lateSwitch_entry. You should see only one switch (on the info tag).

ggreif marked an inline comment as done.Dec 19 2017, 5:39 AM

@alexbiehl I have added a test case for your question (suggestion)

make test TEST=T14373 and look for Main.lateSwitch_entry. You should see only one switch (on the info tag).

Make sure to check for codesize impact on cache.
I made the experience with Trac #14201 that a function can get faster while still slowing down the overall program through (I assume) cache effects.

Also it sounds like this would make matches like case e of { A -> e1; Z -> e2 } marginally more expensive for e = Z.
It would be interesting to know how often we can hit this worst case before directly looking at the info table might become cheaper.

This comment was removed by ggreif.

@simonmar I found an outlier with instruction counts in no fib: spectral/lambda (+0.76%) so I went looking if this has negative effect on ruuntime. It does not, to the contrary:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
         lambda          +0.4%      0.0%     -2.9%     -3.2%      0.0%
--------------------------------------------------------------------------------
            Min          +0.4%      0.0%     -2.9%     -3.2%      0.0%
            Max          +0.4%      0.0%     -2.9%     -3.2%      0.0%
 Geometric Mean          +0.4%     +0.0%     -2.9%     -3.2%     +0.0%

It brings a nice speedup compared to the baseline (my patch backed out).

Sure enough, spectral/lambda uses an 8-constructor ADT, so my code kicks in.
Is this roughly in line with the result in the pointer tagging paper?

I have added a pretty extensive note how the nested switches work

That is indeed a help. But it sits in a larger context: the invariants about pointer tagging. Are these invariants documented anywhere (except in the paper)? Is there a wiki page? For example, in the note you talk about switch [1..7], which takes for granted that you know that

  • there are 3 bits in the pointer we are using for tagging
  • 000 means "unevaluated" or at least "unknown"
  • 111 means "big tag: look in the data con info table"
  • xxx means "this points to a data constructor with tag xxx"

There's a strange <pre-join insertion comes here> fragment in the Note, which I didn't understand.

Looks good though!

Sure enough, spectral/lambda uses an 8-constructor ADT, so my code kicks in.

It'd be really good to add to ticky-ticky some measurements of how many case-expressions scrutinise a value of a big constructor, *and* find something in tags 1..6. That is, how often this new thing avoids a visit to the info table.

That would give useful insight for a blog post (or even a paper) about this. It's extra complexity... how often do we win?

The elapsed-time perf numbers are moving in the right direction, which is great, but it's hard to be sure exactly why... and elapsed time is a flaky number, hard to rely on.

@simonpj Please do not put too hope much into to the numbers, my laptop was on battery when I made them :-( and this machine is not quiet enough for the measurements. Indeed I could not repeat the same score in the meantime, so I'd say they are overly optimistic.

I have a question though. I see CMM code being output like this:

  c49W: // c49T
      _s45W::P64 = R1;
      _c4a2::P64 = _s45W::P64 & 7;
      switch [1 .. 2] _c4a2::P64 {
          case 1 : goto c4a0;
          case 2 : goto c4a1;
      }
  c4a1: // c49T/c4aa
      _s45X::P64 = P64[_s45W::P64 + 6];
      _s45Y::P64 = P64[_s45W::P64 + 14];
      I64[(young<c4ab> + 8)] = c4ab;
      R1 = _s45Y::P64;
      ...
[snip]


  c4a0: // c49T/c4a7
      //tick src<testsuite/tests/simplStg/should_run/T13861.hs:23:10-16>
      R1 = _s45W::P64 & (-8);
      call (I64[R1])(R1) args: 8, res: 0, upd: 8;
}

This looks really crazy to me

  1. in c49W the perfectly tagged pointer is saved as _s45W
  2. then in c4a0 the saved _s45W gets stripped from its ptr-tag, just to be retagged by a call into its constructor entry code

the latter in turn invokes the continuation. Why not invoke the continuation directly with _s45W?

Which part of the CMM-emitter creates c4a0 ?

(I should note that this code is from a ticket related to this review but not the same one: https://ghc.haskell.org/t/13861, but that has no review yet.)

ggreif added a comment.EditedDec 23 2017, 7:14 AM

@simonpj Never mind, it is emitEnter; now I'll go and see who calls that. If I am on the wrong train, just yell!

UPDATE: there is no ticket yet, but a tentative to fix this by https://github.com/ghc/ghc/compare/878936583905%5E...51a1bf064328

@simonpj Unfortunately this relies on pretty fragile name comparison. How would one make this more robust? Can Ids somehow track that they are bound to a scrutinised (WHNF) non-function?

UPDATE: there is no ticket yet, but a tentative to fix this by https://github.com/ghc/ghc/compare/878936583905%5E...51a1bf064328

Interesting. Yes, please make a ticket! (And transfer the info below into it.)

I think the issue is this. Given

data Colour = Red | Green | Blue
f x = case x of y
           Red -> Green
           DEFAULT -> y

(here y is the case binder) we can just return x rather than entering it in DEFAULT branch, because y will be fully evaluated and its pointer will be correctly tagged.

You absolutely can't check for an OccName of "wild"!! That is neither necessary nor sufficient :-).

Instead, check isEvaldUnfolding (idUnfolding y). See Note [Preserve evaluatedness] in CoreTidy. And be sure to augment that Note if you make the change.

I would expect perf benefits to be small on average, but it's simple to implement.

I expect this is fine but there is much I do not understand.

compiler/codeGen/StgCmmExpr.hs
615

What's going on here? What is ptr and info? Why are we partitioning the branches?

630

I just don't understand this code. Can you write a Note to explain, *with an example*? Thanks!

compiler/codeGen/StgCmmUtils.hs
459

I'm lost. Why say emitSwitch x y a b pj when you could instead say emitSwitch x y a b >> pj. And pj is mostly a no op anyway. Why give it this extra argument?

testsuite/tests/codeGen/should_compile/all.T
41

What is being tested here? Add a note to the source file.

Should there not be some expected output?

ggreif added a comment.Jan 2 2018, 6:55 AM

@simonpj I have created https://ghc.haskell.org/trac/ghc/ticket/14626 for the "unnecessary enter" issue.

ggreif added inline comments.Jan 2 2018, 1:15 PM
compiler/codeGen/StgCmmUtils.hs
459

No. Look at line 477, below. There it is equiv to emit <discrete switch> >> pj >> emitLabel join_ll. I should really upload the note here too. Meanwhile here is the link: https://git.haskell.org/ghc.git/commitdiff/57192187a2c086945897259847e3605bd9fea6b9

ggreif added inline comments.Jan 2 2018, 1:24 PM
compiler/codeGen/StgCmmExpr.hs
615

ptr gets bound to the branches that can be switched by looking at the ptr-tag alone. The rest (namely info) will be the branches for which we have to inspect the info table to get a correct tag to switch on.

630

Sure, I can upload the revisions tomorrow, here is the link in the meantime: https://git.haskell.org/ghc.git/commitdiff/57192187a2c086945897259847e3605bd9fea6b9

ggreif added inline comments.Jan 5 2018, 3:36 PM
testsuite/tests/codeGen/should_compile/all.T
41

I am working on a relevant test. Can we use sed in the test suite?

bgamari added inline comments.Jan 7 2018, 12:13 PM
testsuite/tests/codeGen/should_compile/all.T
41

Can we use sed in the test suite?

Yes.

That is indeed a help. But it sits in a larger context: the invariants about pointer tagging. Are these invariants documented anywhere (except in the paper)? Is there a wiki page?

Yes: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution/PointerTagging

compiler/codeGen/StgCmmUtils.hs
456

What happened to this? I still don't think it's necessary to pass in this FCode () here given that it's a separate code block with its own label.

ggreif added inline comments.Jan 14 2018, 4:01 AM
compiler/codeGen/StgCmmUtils.hs
456

I learned that there are no fall-throughs in Cmm, but my code makes this assumption. I'll rewrite.

bgamari requested changes to this revision.Jan 26 2018, 12:16 PM

Bumping out of the review queue while @ggreif amends.

Also, it would be great if we could get a summarized nofib comparison in the commit message.

This revision now requires changes to proceed.Jan 26 2018, 12:16 PM

Let me know if you could use any help on this, @ggreif.

@bgamari This is pretty much blocked on some not-correctly-exported unfoldings, which cause problems. (NB: At the top of the related blocking-bug-stack I have an unsolved problem (dilemma) with STATIC_IND closures exported by modules. I have a draft writeup about my odyssey, I'll try to send it out to ghc-devs and add a link to it here ASAP.)

This is pretty much blocked on some not-correctly-exported unfoldings, which cause problems

Interesting. Would you like to open a ticket if you think some unfolding are not being correctly exported?

@simonpj I remembered wrong, the importing of the unfolding is problematic. There is a ticket already (with a patch) : https://ghc.haskell.org/t/14677 ,
however that still has its share of problems. IND_STATIC closures are never tagged, but logically (*not* physically) aliasing fully evaluated (tagged) constructors. This will be a bit thorny to fix.

@bgamari @simonpj I have filed https://ghc.haskell.org/t/15155. Please take that with a (huge) grain of salt, as it was months ago when I last looked into the issue. Maybe I did an invalid analysis anyway...